Engineering Note

Introducing Cadmus: VietAlpha's Pipeline for Vietnamese Math Reasoning

May 9, 2026 · Harry Tran

Cadmus is the pipeline responsible for pre-building the corpus that VietAlpha retrieves during tutoring sessions.

When a student submits a Grade 10 statistics problem, the retrieved worked examples that ground the tutoring response should come from verified, grade-tagged records that accurately represent the problem, its solution method, and the Vietnamese curriculum conventions that govern how the solution is expected to be presented. Whether those conditions hold depends on the data pipeline.

The Problem Behind the Data

Vietnamese high school mathematics is examined through a curriculum defined by the Ministry of Education and Training, administered across multiple exam formats, and sourced from materials that range from official textbooks to gifted-student competition papers to semester assessment sets. Each format has distinct conventions: a textbook problem may present its solution inline; a competition paper may provide only the final answer; a semester exam may include lettered sub-questions that belong to a shared stem. When these materials are digitized from scanned PDFs, OCR introduces its own error profile. Vietnamese diacritics are misencoded. LaTeX expressions split across line breaks. Problem numbers are read as content. Tables collapse into prose.

These errors occur in every batch. The question is whether the pipeline detects them before a corrupted or incomplete record reaches the retrieval index.

Why this is critical

Wrong answers or incorrect grade attribution in retrieved examples mislead students and propagate systematic errors through the retrieval system.

What Cadmus Is

Cadmus is VietAlpha's official pipeline for converting Vietnamese mathematics source materials into structured, verified, curriculum-tagged records. It parses raw text (from textbooks, exams, and curriculum materials) into eight-field records containing the question, answer choices, solution, hint, and grade level. Cadmus replaces an earlier internal prototype, which has now been retired.

The pipeline is designed around two architectural requirements not present in the earlier prototype: resumability and auditability. Each of the eleven pipeline stages writes its output to a persistent directory that the next stage reads from. A run database records stage status and transition events. For every stage backed by language model calls, a replay ledger tracks which records have already been processed, so that a pipeline interrupted mid-stage can resume from where it stopped rather than starting over. Failures at one stage do not lose the work completed at all prior stages.

The Cadmus Flow

The pipeline runs eleven stages in sequence. Each stage has a defined responsibility and a defined output. A record that fails a stage is flagged explicitly.

Pipeline stages - Cadmus v1

Source preparation

PDF to text via OCR tooling configured for Vietnamese mathematical content. GPU-aware sharding for high-throughput conversion across large document sets.

Surface cleaning

Strips stray formatting characters, broken paragraph joins, and PDF metadata tokens. Mathematical expressions in fence-delimited blocks are preserved intact.

Problem segmentation

Anchor-scoring identifies problem boundaries in Vietnamese mathematical materials. An optional language model scouting pass handles cases rule-based scoring misses.

Structured extraction

Each problem chunk is parsed into an eight-field structured record. The parser extracts what is present in the source without attempting to solve the problem.

Solvability filtering

Four passes in order: deterministic empty/fragment/metadata removal, LLM solvability evaluation, and a repair pass for borderline records. In one observed run, 29% of records were removed at this stage.

Answer and solution completion

Records missing an answer, a solution, or both are completed using retrieved context from previously processed problems (the same retrieval approach used during tutoring sessions).

Mathematical verification

An independent stage checks whether the answer and solution are mathematically consistent with the question. Records failing verification can be sent to a resolver before quarantine.

Schema validation

Records are normalized against the eight-field schema. Minor structural inconsistencies from upstream handling are repaired before each record is confirmed.

Instructional hint generation

Each validated record receives a short Vietnamese-language hint pointing toward the correct method without stating the answer. Records that cannot produce a compliant hint are not passed forward.

Curriculum metadata assignment

Grade level and source category are assigned. An English-language scan quarantines records where OCR has converted Vietnamese characters to visually similar Latin characters. Schema extends from eight to ten fields.

Dataset upload

Final artifacts are synced to the shared dataset store, organized by grade level.

The Cadmus pipeline separates each responsibility into a distinct stage so that failure points are identifiable and upstream work is not discarded when a later stage fails. Stages 6 and 7 use the same retrieval approach VietAlpha applies during tutoring sessions.

The most consequential stage is problem segmentation. Segmentation errors cascade: a chunk that joins two adjacent problems produces one double-length record; a chunk that splits a problem across a boundary produces a record missing either its stem or its solution. Both errors propagate to the parsing stage and must be caught there.

Answer and solution completion in stage 6 is worth examining directly. Some records reach this stage with a missing answer, a missing solution, or both. Cadmus handles three distinct cases: records missing both the answer and solution, records with a solution but no final answer, and records with a final answer but no worked solution. For each incomplete record, it retrieves worked examples from a corpus of previously processed problems and uses that retrieved context alongside the question to generate a candidate answer or solution. The pipeline handles these cases before verification.

How This Differs From What Came Before

The architectural differences between Cadmus and its predecessor are specific.

Aspect	Earlier prototype	Cadmus
Intermediate state	No persistence between stages. Entire document representation passed as a string payload. A failure near the end discarded all upstream work.	Each stage writes to a persistent directory the next stage reads from. Interrupted runs are recoverable without losing prior work.
Validation design	A single component combined structural checks with mathematical correctness checks. Difficult to distinguish formatting failures from reasoning failures when output degraded.	Separate verifier and validator stages. The verifier checks mathematical consistency; the validator checks structural conformance. Failure type is identifiable.
LLM-backed stage resumption	No mechanism for tracking which records had already been submitted. Interrupted runs had no clean resumption path.	A replay ledger in every LLM-backed stage tracks completed records. Resumption re-processes only what was not yet finished.
Failure traceability	A record exiting with a null solution offered no history of which step removed or failed to complete it.	Run database records stage status and transition events. When a record exits with a null solution, the cause is identifiable.

What Cadmus Currently Improves

Cadmus produces records with standardized schemas and verified grade attribution, keeping raw source text distinct from verified output. The staged architecture makes failures traceable: when a record exits with a null solution, the run history identifies which stage flagged it. Source category metadata also distinguishes competition-level materials from standard textbooks, allowing for weighted retrieval.

Important distinction

The impact of these structured records on retrieval quality and tutoring accuracy is currently being evaluated.

What Cadmus Does Not Yet Solve

Mathematical verification reduces error rates but does not eliminate them; a record can pass while containing a plausible but incorrect solution. OCR quality varies by source, and some formatting artifacts persist. The pipeline currently assigns grade and source category but lacks topic-level metadata, limiting retrieval precision. Some records exit with null fields because completion or verification failed.

Yield and Processing Efficiency

Benchmark metrics on the corpus (record counts, pass rates, and grade distribution) describe the pipeline's output characteristics across the processed dataset.

Source input Candidate pool Final questions

Stage 1

Source file

5,400 documents

Split & Parse

×59.85

Average multiplier

Stage 2

Candidate pool

~60 candidates per file

Filter & Tag

50.1%

Pass rate

Stage 3

Final questions

30.37% of total

Observation

The 50.1% pass rate is the pipeline working as intended. The filter is strict by design; only high-quality items reach the tagger. The high initial split multiplier (~60x) means every candidate surfaces before filtering begins. The 30.37% net yield per file shows both stages working across approximately 5,400 source documents.

These metrics describe what Cadmus produced. They do not describe whether students using VietAlpha find the retrieved explanations useful, whether the curriculum alignment translates to classroom accuracy, or whether the worked examples Cadmus generates are at the right level of difficulty.

Next Step

The current retrieval evaluation measures grade match rate and cosine similarity between a query and its top retrieved examples. Future evaluation must determine if retrieved examples share the query's reasoning method.

Distinguishing surface similarity from reasoning similarity requires manual relevance scoring by reviewers familiar with the Vietnamese math curriculum, grade-level error analysis, and case studies to identify if failure patterns concentrate in particular source categories.