Research Note

Schema Design and Boundary Detection in Vietnamese Math PDF Parsing

November 22 – December 10, 2025 · Harry Tran

A PDF extractor that cannot separate a problem statement from its solution produces training records where the question field contains the answer. In Vietnamese math textbook PDFs, question/solution contamination is the default failure mode for naive extraction. Stage 1 tested whether a strict data contract could prevent it.

Vietnamese Math Textbooks Do Not Mark Where Questions End

Problem statements and solutions share the same text block across most Vietnamese math textbook chapters. Solution steps appear inline with the question rather than in a separate, visually distinct block. Publishers and grade levels each apply different conventions, none of which hold consistently across a full textbook.

LaTeX notation compounds this. Mathematical expressions interrupt the natural reading order of Vietnamese text, which makes span-level heuristics that depend on linguistic continuity unreliable. A boundary-detection rule that correctly identifies the end of a question in one chapter may fail on the next page of the same book.

Contaminated Records Pass Validation and Fail Training

A training dataset where question fields contain partial or full solutions encodes a faulty input-output relationship: the model learns to regurgitate a solution when prompted with a question rather than to reason toward one. This contamination does not surface as a formatting error. Records that fail the question/solution boundary test still produce valid JSON and pass field-length checks. The error becomes visible only when training outcomes are evaluated, at which point the contaminated data has already been incorporated.

Enforcing the boundary at extraction time is the only point in the pipeline where rejection is both cheap and reliable. Downstream correction requires locating contaminated records after the fact, which across a corpus of 50,000 records is not tractable without a separate annotation pass.

A Four-Field Contract, Defined Before the First Script Ran

Before any extraction script ran, the team defined a strict JSON schema as a data contract across all parse runs. Every problem record was required to populate four fields at minimum: the question text, the answer, the grade level, and the topic category. Any record that failed to satisfy all four fields was dropped from the output.

A representative valid record conforms to the following structure:

{
  "question": "Tim nghiem cua phuong trinh ...",
  "answer": "x = 3",
  "grade": 10,
  "topic": "algebraic_equations"
}

The schema had two roles. First, it made validation binary: a record either satisfied all four required fields or it did not. Second, because all three machines produced output conforming to the same field structure, merging became a mechanical step: concatenate, deduplicate by structural identity, validate against schema. Any softer approach would have introduced structural incompatibilities requiring manual reconciliation at the merge step.

Design principle

The schema enforced output compatibility across all three machines before any runs were merged.

Execution of the Extraction Pipeline

Extraction

Extraction scripts read raw textbook text and classified each contiguous span as belonging to one of the four required fields. The central challenge was identifying the boundary between a problem statement and its solution.

The approach was iterative: run the extractor, inspect failures against the schema, refine the boundary-detection heuristic, repeat. The schema's strict rejection rules were a correctness signal throughout this process. A record that passed validation was one where the classifier had separated question from solution cleanly enough to satisfy structural checks. Records that failed pointed directly at the cases where the heuristic broke.

Infrastructure

Paid cloud compute was unavailable, so processing ran across three physical laptops, each authenticated to a separate free-tier cloud compute session. The total corpus was split into chunks calibrated to fit within the per-session memory limit. Each machine handled its chunk independently and outputs were merged after each successful run.

Partial outputs were flushed to disk at regular checkpoints to limit data loss from mid-run crashes. Files recovered at crash points are partial captures of in-memory state and may be incomplete relative to the chunk they were processing.

A Grade 10 Problem That Would Have Broken Training

The following illustrates the question/solution contamination failure mode that drove boundary-detection refinement throughout Stage 1.

A raw text span from a grade 10 algebra chapter, as extracted directly from the PDF:

Tim gia tri cua x biet: 2x + 5 = 11
Giai: 2x = 11 - 5 = 6, x = 3
Vay x = 3

Initial extraction, before boundary-detection refinement (incorrect):

{
  "question": "Tim gia tri cua x biet: 2x + 5 = 11 Giai: 2x = 11 - 5 = 6, x = 3 Vay x = 3",
  "answer": "",
  "grade": 10,
  "topic": "algebraic_equations"
}

This record fails the schema contract: the answer field is empty. The record is rejected. Without the strict rejection rule, this record would have entered the training set with the full solution embedded in the question field, encoding the wrong relationship between input and expected output.

After boundary-detection refinement, the same span produces:

{
  "question": "Tim gia tri cua x biet: 2x + 5 = 11",
  "answer": "x = 3",
  "grade": 10,
  "topic": "algebraic_equations"
}

This record passes validation. The schema rejection forced the extractor toward a classifier that could distinguish tim gia tri cua x biet (find the value of x given) from giai (solution). The distinction requires recognizing Vietnamese mathematical discourse markers; structural PDF layout features are insufficient for this classification.

What the Pipeline Produced

Dates	Milestone
Nov 22 – Nov 24	Schema definition Defined the four-field data contract. Established strict rejection rules for records failing field validation.
Nov 25 – Nov 27	Parsing pipeline iteration Built initial extraction scripts. Identified and partially resolved the question/solution contamination failure mode through iterative boundary-detection refinement.
Nov 28	First successful large-scale merge End-to-end run processed and merged multiple batch outputs without schema violations. The merged dataset contained more than 50,000 problem records.
Nov 29 – Dec 05	Manual scaling via chunking Expanded throughput across three cloud compute sessions. Implemented checkpoint saves to limit data loss from crashes.
Dec 10	Stage conclusion Both core questions answered. Stage closed after validating parse feasibility and rejecting free-tier compute as a scaling path.

The November 28 run was the first end-to-end pass to produce a valid merged dataset without schema violations. The merged output contained more than 50,000 problem records drawn from multiple batch outputs across three machines. The merge succeeded because the schema contract held across all contributing runs: each machine had independently produced output conforming to the same field structure, which meant concatenation and deduplication could proceed without reconciliation.

What the November 28 Merge Established, and What It Did Not

Vietnamese math PDFs can be parsed into valid structured records in volume, given careful boundary-detection logic and strict field validation at extraction time. Going into Stage 1, the structural inconsistency of Vietnamese math textbooks was a known risk, and whether any extraction approach would yield a usable corpus was unresolved.

Making validation explicit at extraction time prevented incompatible outputs from accumulating across runs. Whether the 50,000+ records constitute a high-quality training corpus depends entirely on boundary-detection accuracy, which has not been independently audited. Structural validity (fields present, types correct) has been verified. Boundary accuracy has not.

Interpretation note

Passing schema validation confirms that a record has non-empty fields in the correct positions. It does not confirm that the question/solution boundary was drawn at the correct location within the source text. These are different claims, and only the first is verified by the pipeline as described here.

Three Points Where the Approach Broke Down

Free-tier compute cannot support this pipeline at scale. Geometry-heavy chapters and pages with embedded diagrams were the primary out-of-memory triggers, because diagram extraction combined with the surrounding math notation exceeded the per-session memory budget. The supervision overhead (monitoring session state, restarting on crash, rechunking after memory failures) is not operationally sustainable for a corpus of this size.
Naive extraction is fragile across Vietnamese math PDFs. Layout variation across books and chapters means that heuristics which worked on one chapter would fail on the next. The question/solution boundary was the most consistent failure point: Vietnamese math notation and inline solution steps make span-level classification unreliable without iterative, chapter-specific refinement.
Session interruption produces incomplete coverage records. Files recovered from crash points are partial runs of unknown completeness. The fraction of each chunk processed before a session failure is not systematically recorded, making coverage estimates for the full corpus unreliable.

What This Stage Does Not Prove

The 50,000-record merged dataset has not been audited for question/solution boundary accuracy. Schema validation confirms field presence, not boundary correctness. A labeled sample review was not completed as part of Stage 1.
Boundary-detection heuristics were refined against a subset of the corpus drawn from one publisher's catalogue. Chapters from other publishers or grade bands may use different inline solution conventions and are likely to produce different failure rates under the same extraction logic.
The three-machine processing architecture means chunk coverage is not deterministic. The total fraction of the source corpus that was successfully processed and passed into the merged dataset is not precisely known.
The deduplication step matches on exact field content. Near-duplicate problems with minor textual differences may appear as distinct records in the merged output.

Before Any Training Work Begins

Move PDF extraction to a controlled local environment with fixed memory limits and deterministic session behavior, then evaluate the boundary-detection heuristics against a manually labeled sample of at least 100 records.

Project stage: Infrastructure Experimentation & Data Structuring (AI Training v1). Converted from the project timeline log. Stage report authored: December 10, 2025.