Engineering Note Cadmus Pipeline Stage 3

First End-to-End Pipeline Prototype

A sequential pipeline that successfully converts a PDF into a JSON record does not guarantee the record contains valid mathematics. Code-Flow-v1 demonstrated this directly: the pipeline completed batch runs without errors while producing records where a sub-question was separated from its parent stem at a sliding-window boundary, or where an answer referenced a diagram that had been stripped during extraction. Format validation passed. The pipeline had no mechanism to check content integrity.

Three Silent Failure Modes in the Source Data

Structuring Vietnamese high school math content for model training requires solving three encoding and formatting problems in sequence, each of which can corrupt the output of the next step without producing a visible error.

Vietnamese character encoding inconsistency. Consumer-grade PDF scanning software frequently encodes Vietnamese diacritics as base characters paired with combining Unicode modifiers rather than as precomposed codepoints. The character "a" + U+0306 renders the same as the precomposed U+0103 ("ă") in a PDF viewer. String comparison, tokenization, and embedding lookup treat them as different sequences. This inconsistency appears across the source corpus and varies by scanning software version.

LaTeX and JSON serialization conflict. Vietnamese math textbooks embed mathematical notation in LaTeX syntax. JSON and LaTeX use the backslash for incompatible purposes: JSON treats it as an escape prefix, while LaTeX uses it as the command prefix for every function (\frac, \sqrt, \sum). An extraction model that outputs valid-looking JSON with embedded LaTeX will frequently produce strings that break deserialization because the backslashes are not double-escaped. A repair heuristic can patch the most common cases, but incorrectly patched records enter the corpus with corrupted mathematical expressions.

Multi-part problem structure. Vietnamese math problems routinely consist of a single stem followed by labeled sub-questions (a), b), c)). A text extraction system operating on fixed-size segments will sometimes place the stem and its sub-questions in different segments. The resulting record contains a sub-question as its question field, with no stem context, and may appear structurally valid because all required schema fields are present.

Why These Failures Go Undetected

None of these failures produce a pipeline error. A Unicode normalization inconsistency passes silently into the training corpus and degrades tokenization and answer-matching accuracy downstream. A LaTeX deserialization failure causes a record to be dropped, or - if the repair heuristic applies incorrectly - kept with corrupted mathematical content. A sub-question separated from its stem passes the format schema check because the required fields are populated.

A training corpus with these defects does not fail at load time. The defects surface later, when a fine-tuned model generates answers that reference missing context, produces tokenization artifacts from inconsistent diacritics, or learns from malformed mathematical expressions. Detecting these failures required explicit logic targeting each one. Building and tuning that logic occupied more development time than the extraction and structuring phases combined.

How Code-Flow-v1 Was Assembled

Code-Flow-v1 connects three stages in sequence, with no persistent intermediate state between them:

PDF → Markdown → Cleaned Records → Structured Training Examples

Stage 1: OCR extraction. A batch processing wrapper converts PDF pages to Markdown using a Vietnamese-capable OCR engine in forced mode, bypassing the PDF text layer entirely and treating each page as a raster image.

Stage 2: Windowed extraction. A sliding-window component divides the Markdown output into overlapping segments and submits each segment to the Gemini API, requesting structured records conforming to this schema:

{
  "question": "string",
  "answer":   "string",
  "topic":    "string",
  "grade":    "integer (9, 10, 11, or 12)",
  "type":     "'mcq' | 'open'"
}

Stage 3: Validation. A two-pass validation component handles output quality. A deterministic structural cleaner runs first, followed by an LLM-based content filter.

How Each Stage Was Configured

OCR Configuration

Forced OCR mode was required for two reasons. Vietnamese uses 134 distinct characters when diacritics and tone marks are counted - far more than Latin-script languages. PDF text layers produced by consumer-grade scanning software frequently encode these as base characters with combining Unicode modifiers. The OCR engine, applied to the raster page image, produces consistent precomposed Unicode output regardless of how the source PDF was internally encoded.

Mathematical diagrams - primarily geometry constructions - were excluded from the extraction output. Generating text descriptions of geometry figures would introduce unverifiable content into the training corpus. Diagram-dependent problems were identified in a later routing step and excluded from the corpus.

Sliding Window Parameters

Each window covered approximately one page of text, sized at 500-800 tokens depending on mathematical density. Adjacent windows shared a forward overlap of two to three paragraphs. The overlap was necessary to prevent problems that span a page boundary from being truncated at the window edge - a failure that produces records missing either the question stem or the answer, with no downstream fix available.

The overlap introduces a cost: on a 200-page document with 20% overlap, approximately 1.2x the base token count is submitted across all windows.

Structural Cleaning

The deterministic cleaner ran first. It applied deduplication by question hash, normalized Unicode inconsistencies including variation between composed and decomposed Vietnamese diacritics that had survived the OCR step, and applied a geometry routing pass. The geometry router used a keyword heuristic: detecting Vietnamese geometry terms ("hình", "tam giác", "chứng minh rằng") combined with the absence of purely numerical or algebraic answer content flagged a record as likely diagram-dependent. These records were written to a separate exclusion log rather than discarded.

Content Filtering

The LLM-based content filter applied a Judge prompt to each remaining record, evaluating three criteria in sequence:

  1. Mathematical completeness. Does the question contain sufficient context to be solved from text alone, without referencing surrounding document material?
  2. Language validity. Is the question and answer in Vietnamese? Mixed-language or transliterated records were flagged.
  3. Structural integrity. Are sub-questions (labeled a), b), c)) accompanied by their parent question stem?

Records failing any criterion were excluded. The structural cleaner ran before the Judge filter because the filter produced inconsistent results on records with Unicode normalization issues - the deterministic pass was a prerequisite for reliable LLM-based evaluation.

Concrete Example: Sub-Question Separation at a Window Boundary

The following example is representative of the structural integrity failure mode. The format and difficulty match Grade 10 algebra problems in Vietnamese textbooks; the numbers are illustrative.

Suppose a source document contains a problem that straddles a page boundary. Window A ends with the stem:

Cho tam thức bậc hai f(x) = 2x² − 5x + 3. Tìm các giá trị của x sao cho:

Translation: "Given the quadratic f(x) = 2x² − 5x + 3, find the values of x such that:"

Window B begins with the sub-questions, without the stem in its context:

a) f(x) > 0
b) f(x) ≤ 0
c) f(x) = 0

The extraction component, processing Window B, generates a record with "question": "a) f(x) > 0" and an answer solving the inequality in isolation. The answer is mathematically correct. All required schema fields are present. This record passes the format check and reaches the Judge filter.

The structural integrity criterion identifies the sub-question pattern - the "a)" label without a preceding context sentence - and excludes the record. Without this filter, the record would have entered the training corpus. A model trained on it would encounter "a) f(x) > 0" as a standalone problem, a form that does not appear in textbooks and provides no useful training signal.

The forward overlap between Window A and Window B resolves this failure when the stem ends well before the window boundary. In this case the stem occupied only the final paragraph of Window A. The overlap carried the sub-questions into Window A's context but did not carry the stem into Window B's context. The directional asymmetry of the overlap is the root cause: only forward overlap was implemented.

What the Pipeline Produced

The pipeline produced 4,192 clean records across the corpus, processed without manual handoffs between stages. These records passed all three Judge criteria and the structural cleaner filters.

Per-stage failure counts were not tracked in Code-Flow-v1. The total exclusion rate - records flagged at any stage relative to total extracted records - was not measured in this prototype. The 4,192 figure counts records that passed all filters. Recall against the source corpus was not measured.

Completion Is Not the Same as Recall

The net output count confirms that sequential automation works end-to-end: a directory of Vietnamese math PDFs can be processed into training records without manual intervention at any step.

Recall is a separate question. Without per-stage failure counts, it cannot be estimated. The geometry router, the LaTeX repair step, and the Judge filter each exclude records independently. If any component has a high false-positive rate, a meaningful fraction of valid problems may have been excluded without detection. A pipeline that excludes aggressively produces fewer clean records than one with high recall, and both produce the same completion signal.

Four Constraints That Cannot Be Tuned Away

Four failure modes were identified as structural constraints rather than parameter choices. They cannot be addressed by adjusting window size, overlap fraction, or prompt wording.

Failure mode Mechanism Effect
Overlapping window cost 20% overlap across all windows multiplies token consumption by approximately 1.2x the base count. Cost scales linearly with corpus size. API cost per document is substantially higher than a single-pass approach. Not sustainable at full corpus scale.
No intermediate checkpointing Entire document state is passed as string payloads between pipeline stages. No intermediate file format or database stores stage outputs. A failure at the validation step, after successful OCR and extraction, requires reprocessing from the OCR step. On long documents, failures near the end lose the most upstream work.
No partial recovery No resume logic between pipeline stages. A document either completes fully or restarts from the beginning. No mechanism to retry only the failed segment. Reprocessing cost is proportional to document length regardless of where the failure occurred.
Hallucination in validated output The Judge filter checks structural completeness and language validity, not mathematical correctness. A record containing a plausible but incorrect solution passes all filter criteria. Mathematical validity requires a verification layer that was not implemented in this stage.
Four structural failure modes identified during Stage 3 evaluation. All four require architectural changes. Adjusting parameters cannot fix them.

Design note

The two-pass validation design - one pass for structural repair, one for content evaluation - was introduced to isolate failure modes during debugging. As each pass accumulated edge-case logic, the passes became difficult to test independently. The LLM-based filter in particular showed inconsistent behavior on borderline records as its evaluation criteria list grew. Each validation criterion should be a targeted, independently testable component. Accumulating all criteria in one LLM prompt makes behavior unpredictable as the list grows.

What the Prototype Cannot Tell Us

The pipeline has no per-stage metrics. Exclusion counts, failure type distributions, and per-document error rates were not logged in Code-Flow-v1. This makes it impossible to estimate the precision or recall of the extraction step independent of the filtering steps. The 4,192 clean record count measures what the pipeline kept. There was no labeled ground truth to measure quality against.

The geometry router keyword heuristic was calibrated against a limited review of excluded records. It may produce false positives on algebraic word problems that use Vietnamese geometry-adjacent terms - "tam giác" appears in some combinatorics problems as well as geometry. The false-positive rate under the full corpus was not measured.

The Judge filter's performance on ambiguous records - borderline sub-question separation, mixed-language notation, unusual problem formats - was not evaluated against a labeled set. Its exclusion decisions on edge cases cannot be verified without ground truth annotations.

The LaTeX repair heuristic was applied as a pre-deserialization fix. Records that the heuristic patched incorrectly may contain corrupted mathematical expressions that passed all subsequent filters. These are indistinguishable from valid records in the output.

Next Step

Introduce persistent intermediate storage between pipeline stages: each extracted record written to a database row after the extraction step, before the validation step runs. This decouples the stages so the filtering and validation logic can be retried, modified, and re-applied to existing extracted records without reprocessing the source PDFs. Per-stage failure logging follows as a side effect of the record lifecycle.

Status

Code-Flow-v1 is retired. The 4,192 clean records now in the VietAlpha corpus were produced by a subsequent pipeline generation that addressed the checkpointing, serialization, and validation failures documented here.