Engineering Note Cadmus Pipeline Stage 4
Introducing Cadmus: VietAlpha's Pipeline for Vietnamese Math Reasoning
May 9, 2026 · Harry Tran
Cadmus is the pipeline responsible for pre-building the corpus that VietAlpha retrieves during tutoring sessions.
When a student submits a Grade 10 statistics problem, the retrieved worked examples that ground the tutoring response should come from verified, grade-tagged records that accurately represent the problem, its solution method, and the Vietnamese curriculum conventions that govern how the solution is expected to be presented. Whether those conditions hold depends on the data pipeline.
The Problem Behind the Data
Vietnamese high school mathematics is examined through a curriculum defined by the Ministry of Education and Training, administered across multiple exam formats, and sourced from materials that range from official textbooks to gifted-student competition papers to semester assessment sets. Each format has distinct conventions: a textbook problem may present its solution inline; a competition paper may provide only the final answer; a semester exam may include lettered sub-questions that belong to a shared stem. When these materials are digitized from scanned PDFs, OCR introduces its own error profile. Vietnamese diacritics are misencoded. LaTeX expressions split across line breaks. Problem numbers are read as content. Tables collapse into prose.
These errors occur in every batch. The question is whether the pipeline detects them before a corrupted or incomplete record reaches the retrieval index.
Wrong answers or incorrect grade attribution in retrieved examples mislead students and propagate systematic errors through the retrieval system.
What Cadmus Is
Cadmus is VietAlpha's official pipeline for converting Vietnamese mathematics source materials into structured, verified, curriculum-tagged records. It parses raw text (from textbooks, exams, and curriculum materials) into eight-field records containing the question, answer choices, solution, hint, and grade level. Cadmus replaces an earlier internal prototype, which has now been retired.
The pipeline is designed around two architectural requirements that the earlier prototype lacked: resumability and auditability. Each of the eleven pipeline stages writes its output to a persistent directory that the next stage reads from. A run database records stage status and transition events. For every stage backed by language model calls, a replay ledger tracks which records have already been processed, so that a pipeline interrupted mid-stage can resume from where it stopped rather than starting over. Failures at one stage do not lose the work completed at all prior stages.
The Cadmus Flow
The pipeline runs eleven stages in sequence. Each stage has a defined responsibility and a defined output. A record that fails a stage is flagged explicitly.
The Cadmus pipeline separates each responsibility into a distinct stage so that failure points are identifiable and upstream work is not discarded when a later stage fails. Stages 6 and 7 use the same retrieval approach VietAlpha applies during tutoring sessions.
The most consequential stage is problem segmentation. Segmentation errors cascade: a chunk that joins two adjacent problems produces one double-length record; a chunk that splits a problem across a boundary produces a record missing either its stem or its solution. Both errors survive to the parsing stage and must be caught there.
Answer and solution completion in stage 6 is worth examining directly. Some records reach this stage with a missing answer, a missing solution, or both. Cadmus handles three distinct cases: records missing both the answer and solution, records with a solution but no final answer, and records with a final answer but no worked solution. For each incomplete record, it retrieves worked examples from a corpus of previously processed problems and uses that retrieved context alongside the question to generate a candidate answer or solution. The pipeline handles these cases before verification.
How This Differs From What Came Before
The architectural differences between Cadmus and its predecessor are specific.
| Aspect | Earlier prototype | Cadmus |
|---|---|---|
| Intermediate state | No persistence between stages. Entire document representation passed as a string payload. A failure near the end discarded all upstream work. | Each stage writes to a persistent directory the next stage reads from. Interrupted runs are recoverable without losing prior work. |
| Validation design | A single component combined structural checks with mathematical correctness checks. Difficult to distinguish formatting failures from reasoning failures when output degraded. | Separate verifier and validator stages. The verifier checks mathematical consistency; the validator checks structural conformance. Failure type is identifiable. |
| LLM-backed stage resumption | No mechanism for tracking which records had already been submitted. Interrupted runs had no clean resumption path. | A replay ledger in every LLM-backed stage tracks completed records. Resumption re-processes only what was not yet finished. |
| Failure traceability | A record exiting with a null solution offered no history of which step removed or failed to complete it. | Run database records stage status and transition events. When a record exits with a null solution, the cause is identifiable. |
What Cadmus Currently Improves
Cadmus produces records with standardized schemas and verified grade attribution, keeping raw source text distinct from verified output. The staged architecture makes failures traceable: when a record exits with a null solution, the run history identifies which stage flagged it. Source category metadata also distinguishes competition-level materials from standard textbooks, allowing for weighted retrieval.
The impact of these structured records on retrieval quality and tutoring accuracy is currently being evaluated.
What Cadmus Does Not Yet Solve
Mathematical verification reduces error rates but does not eliminate them; a record can pass while containing a plausible but incorrect solution. OCR quality varies by source, and some formatting artifacts persist. The pipeline currently assigns grade and source category but lacks topic-level metadata, limiting retrieval precision. Some records exit with null fields because completion or verification failed.
Yield and Processing Efficiency
Benchmark metrics on the corpus (record counts, pass rates, and grade distribution) describe the pipeline's output characteristics across the processed dataset.
The 50.1% pass rate is the pipeline working as intended. The filter is strict by design; only high-quality items reach the tagger. The high initial split multiplier (~60x) means every candidate surfaces before filtering begins. The 30.37% net yield per file shows both stages working across approximately 5,400 source documents.
These metrics describe what Cadmus produced. They do not describe whether students using VietAlpha find the retrieved explanations useful, whether the curriculum alignment translates to classroom accuracy, or whether the worked examples Cadmus generates are at the right level of difficulty.
Next Step
The current retrieval evaluation measures grade match rate and cosine similarity between a query and its top retrieved examples. Future evaluation must determine if retrieved examples share the query's reasoning method.
Distinguishing surface similarity from reasoning similarity requires manual relevance scoring by reviewers familiar with the Vietnamese math curriculum, grade-level error analysis, and case studies to identify if failure patterns concentrate in particular source categories.