Research Note Cadmus Pipeline Stage 1
November 22 – December 10, 2025 · Harry Tran
A PDF extractor that cannot separate a problem statement from its solution produces training records where the question field contains the answer. In Vietnamese math textbook PDFs, question/solution contamination is the default failure mode for naive extraction. Stage 1 tested whether a strict data contract could prevent it.
Problem statements and solutions share the same text block across most Vietnamese math textbook chapters. Solution steps appear inline with the question rather than in a separate, visually distinct block. Publishers and grade levels each apply different conventions, none of which hold consistently across a full textbook.
LaTeX notation compounds this. Mathematical expressions interrupt the natural reading order of Vietnamese text, which makes span-level heuristics that depend on linguistic continuity unreliable. A boundary-detection rule that correctly identifies the end of a question in one chapter may fail on the next page of the same book.
A training dataset where question fields contain partial or full solutions encodes a faulty input-output relationship: the model learns to regurgitate a solution when prompted with a question rather than to reason toward one. This contamination does not surface as a formatting error. Records that fail the question/solution boundary test still produce valid JSON and pass field-length checks. The error becomes visible only when training outcomes are evaluated, at which point the contaminated data has already been incorporated.
Enforcing the boundary at extraction time is the only point in the pipeline where rejection is both cheap and reliable. Downstream correction requires locating contaminated records after the fact, which across a corpus of 50,000 records is not tractable without a separate annotation pass.
Before any extraction script ran, the team defined a strict JSON schema as a data contract across all parse runs. Every problem record was required to populate four fields at minimum: the question text, the answer, the grade level, and the topic category. Any record that failed to satisfy all four fields was dropped from the output.
A representative valid record conforms to the following structure:
{
"question": "Tim nghiem cua phuong trinh ...",
"answer": "x = 3",
"grade": 10,
"topic": "algebraic_equations"
}
The schema had two roles. First, it made validation binary: a record either satisfied all four required fields or it did not. Second, because all three machines produced output conforming to the same field structure, merging became a mechanical step: concatenate, deduplicate by structural identity, validate against schema. Any softer approach would have introduced structural incompatibilities requiring manual reconciliation at the merge step.
The schema enforced output compatibility across all three machines before any runs were merged.
Extraction scripts read raw textbook text and classified each contiguous span as belonging to one of the four required fields. The central challenge was identifying the boundary between a problem statement and its solution.
The approach was iterative: run the extractor, inspect failures against the schema, refine the boundary-detection heuristic, repeat. The schema's strict rejection rules were a correctness signal throughout this process. A record that passed validation was one where the classifier had separated question from solution cleanly enough to satisfy structural checks. Records that failed pointed directly at the cases where the heuristic broke.
Paid cloud compute was unavailable, so processing ran across three physical laptops, each authenticated to a separate free-tier cloud compute session. The total corpus was split into chunks calibrated to fit within the per-session memory limit. Each machine handled its chunk independently and outputs were merged after each successful run.
Partial outputs were flushed to disk at regular checkpoints to limit data loss from mid-run crashes. Files recovered at crash points are partial captures of in-memory state and may be incomplete relative to the chunk they were processing.
The following illustrates the question/solution contamination failure mode that drove boundary-detection refinement throughout Stage 1.
A raw text span from a grade 10 algebra chapter, as extracted directly from the PDF:
Tim gia tri cua x biet: 2x + 5 = 11
Giai: 2x = 11 - 5 = 6, x = 3
Vay x = 3
Initial extraction, before boundary-detection refinement (incorrect):
{
"question": "Tim gia tri cua x biet: 2x + 5 = 11 Giai: 2x = 11 - 5 = 6, x = 3 Vay x = 3",
"answer": "",
"grade": 10,
"topic": "algebraic_equations"
}
This record fails the schema contract: the answer field is empty. The record is rejected. Without the strict rejection rule, this record would have entered the training set with the full solution embedded in the question field, encoding the wrong relationship between input and expected output.
After boundary-detection refinement, the same span produces:
{
"question": "Tim gia tri cua x biet: 2x + 5 = 11",
"answer": "x = 3",
"grade": 10,
"topic": "algebraic_equations"
}
This record passes validation. The schema rejection forced the extractor toward a classifier that could distinguish tim gia tri cua x biet (find the value of x given) from giai (solution). The distinction requires recognizing Vietnamese mathematical discourse markers; structural PDF layout features are insufficient for this classification.
| Dates | Milestone |
|---|---|
| Nov 22 – Nov 24 | Schema definition Defined the four-field data contract. Established strict rejection rules for records failing field validation. |
| Nov 25 – Nov 27 | Parsing pipeline iteration Built initial extraction scripts. Identified and partially resolved the question/solution contamination failure mode through iterative boundary-detection refinement. |
| Nov 28 | First successful large-scale merge End-to-end run processed and merged multiple batch outputs without schema violations. The merged dataset contained more than 50,000 problem records. |
| Nov 29 – Dec 05 | Manual scaling via chunking Expanded throughput across three cloud compute sessions. Implemented checkpoint saves to limit data loss from crashes. |
| Dec 10 | Stage conclusion Both core questions answered. Stage closed after validating parse feasibility and rejecting free-tier compute as a scaling path. |
The November 28 run was the first end-to-end pass to produce a valid merged dataset without schema violations. The merged output contained more than 50,000 problem records drawn from multiple batch outputs across three machines. The merge succeeded because the schema contract held across all contributing runs: each machine had independently produced output conforming to the same field structure, which meant concatenation and deduplication could proceed without reconciliation.
Vietnamese math PDFs can be parsed into valid structured records in volume, given careful boundary-detection logic and strict field validation at extraction time. Going into Stage 1, the structural inconsistency of Vietnamese math textbooks was a known risk, and whether any extraction approach would yield a usable corpus was unresolved.
Making validation explicit at extraction time prevented incompatible outputs from accumulating across runs. Whether the 50,000+ records constitute a high-quality training corpus depends entirely on boundary-detection accuracy, which has not been independently audited. Structural validity (fields present, types correct) has been verified. Boundary accuracy has not.
Passing schema validation confirms that a record has non-empty fields in the correct positions. It does not confirm that the question/solution boundary was drawn at the correct location within the source text. These are different claims, and only the first is verified by the pipeline as described here.
Move PDF extraction to a controlled local environment with fixed memory limits and deterministic session behavior, then evaluate the boundary-detection heuristics against a manually labeled sample of at least 100 records.
Project stage: Infrastructure Experimentation & Data Structuring (AI Training v1). Converted from the project timeline log. Stage report authored: December 10, 2025.