Research Note Cadmus Pipeline Stage 2
Local PDF Extraction for Vietnamese Math Content: A Feasibility Study
December 22–25, 2025 · Harry Tran
A PDF page can contain both a native text layer and a raster image of the same mathematical content. Standard extraction reads the text layer and silently ignores the image. The result is a record that appears complete but is missing part of the problem. This report documents a four-day experiment testing whether a six-stage Python pipeline could produce reliable, memory-stable structured records from Vietnamese math PDFs on a single development laptop - and identifying exactly where the extraction breaks.
1. Mixed-Format PDFs and the Silent Miss
Vietnamese secondary math textbooks arrive as heterogeneous PDFs. A single file commonly mixes native digital text, scanned page images, embedded geometry diagrams, and mathematical notation rendered in raster, vector, or both. Standard PDF text extraction reads the document's internal character stream. It handles native text correctly. On image-rendered content, it produces no output. The record is simply absent.
This creates two compounding problems. First, pages with scanned math content are silently excluded unless the pipeline explicitly detects and routes them. Second, LaTeX notation extracted from either source type is frequently broken after OCR: missing braces, Unicode substitution for math operators, truncated macro names. A formula that appears to be LaTeX but is syntactically invalid cannot be rendered by a math parser and is unusable as training data.
2. Why Silent Gaps Are Worse Than Crashes
A dataset with silent extraction failures has an unknown coverage rate. Records that appear structurally valid may be missing content: a problem statement without its associated diagram, or a formula with numerator and denominator transposed by OCR. This is harder to detect than an outright pipeline crash, because downstream processes receive a well-formed record with no indication that content is absent.
For VietAlpha, each such record becomes a data point pointing toward a wrong answer. The central research question was narrow: can the extraction process be made reliable and memory-stable enough to run unattended at scale on standard hardware? Discovering that answer here, on a small sample with no cloud costs, is far less expensive than discovering it during a production run.
3. Building the Pipeline
The pipeline was built incrementally over four days, each stage added to address a specific failure mode in the previous run. The result is a six-stage hybrid system.
3.1 Initial approach: uniform rasterization
The first implementation treated every page as an image, regardless of whether the underlying PDF contained extractable text. Each page was rendered to a bitmap at 300 DPI using pdf2image (a poppler wrapper), then passed to Tesseract with the Vietnamese language pack for OCR.
The intent was robustness: by treating everything as an image, the pipeline could handle both native text and scanned pages without a detection step. In practice, this was too heavy for the hardware. Rasterizing a 200-page book at 300 DPI produced approximately 2 GB of pixel data. The laptop overheated within minutes, and processing a single file took eight minutes - too slow for a corpus of thousands of pages.
3.2 Hybrid extraction with AI verification
The second version introduced two extraction paths. Pages with extractable text layers were handled by PyMuPDF's native extraction, which reads directly from the PDF character stream without rasterization. Pages yielding fewer than 10 characters from native extraction were treated as image-dominant and sent to Tesseract.
A lightweight Gemini model (Flash tier) was added as a structural check on OCR output. The model scanned each page's raw text for structural anomalies - table headers merged into paragraphs, formulas split across lines - and flagged them without rewriting the content. This kept verification fast and prevented the model from producing text not present in the source.
3.3 Per-document error isolation
As the pipeline grew more complex, single-document failures began crashing entire processing runs. A corrupted PDF - one with a missing page tree or an encoding error - propagated an exception that terminated the script. Overnight runs lost hours of work to a single bad file.
The codebase was refactored so each document is processed within its own exception boundary. On failure, the document path, page number, and exception class are written to a structured error log and the script moves to the next file without retry. This made unattended overnight runs stable.
3.4 Memory-bounded batch processing
Processing files sequentially without releasing memory caused RAM usage to climb until the system stalled. PyMuPDF and PIL objects tend to persist in memory after a file is finished, even when no longer referenced by application code.
Documents were grouped into fixed-size batches of five. After each batch completed, output was serialized to disk, all document handles were explicitly closed, image objects were deleted, and gc.collect() was called to force a full garbage collection cycle before the next batch loaded. This kept peak memory use bounded to approximately five documents' worth of bitmap and text data, rather than growing with corpus size.
3.5 LaTeX integrity validation
OCR frequently corrupted LaTeX notation. Common errors included unclosed braces (\frac{a}{b), Unicode symbols replacing math operators (degree symbols in place of \circ), and backslashes rendered as pipes or forward slashes.
A validation gate was added before records were written to the output store. The gate applied three checks: brace balance via a stack counter, detection of known OCR substitution patterns via regular expressions, and presence checks for truncated standard macro names. Records failing validation were written to a separate flagged queue for manual review, kept apart from the clean output.
3.6 Heuristic fast/slow routing
Even with selective rasterization, processing time remained dominated by pages sent to Tesseract. A routing gate was added before the extraction step to classify each page into a fast lane (native extraction only) or slow lane (rasterization, OCR, and verification).
The gate used two signals: the character count from PyMuPDF's direct extraction (below 10 characters per page, the page was treated as image-dominant) and the ratio of image object area to total page area, computed from the PDF's embedded image metadata. Pages above both thresholds took the fast lane. Pages falling below either threshold took the slow lane.
This reduced Tesseract calls to only pages that required them, cutting median per-document processing time from approximately eight minutes to under ninety seconds.
4. Pipeline Summary
The diagram below shows the six-stage pipeline as it existed at the end of the experiment.
per-document exception boundary - log error class and path, continue
5-document batches, serialize & gc.collect() between
char density + image-area ratio → fast or slow lane
character stream read on text-layer pages
300 DPI bitmap, vie language pack, flash-tier structural review
brace balance, OCR substitution patterns, truncated macros
Clean records to JSONL; flagged records to review queue
5. Benchmarks
The table below compares the three main configurations tested during the experiment. Each was run on the same hardware against the same document sample.
| Full rasterization | Hybrid fast/slow routing | Native extraction only | |
|---|---|---|---|
| Median per-document time Single document, ~100 pages | ∼8 min | ∼90 sec | ∼12 sec |
| Image-dominant page coverage Pages with no native text layer | Full | Full | None (silent miss) |
| LaTeX extraction reliability Formulas structurally intact post-extraction | Low (no gate) | High (gated) | High on native; zero on image pages |
| Memory stability Unattended overnight runs | Crashes at scale | Stable (batched gc) | Stable |
| AI verification benefit Structural anomaly detection | N/A | Effective on moderate noise; degrades on heavy artifacts | N/A |
6. A Grade 10 Page That Looked Fully Extracted
During extraction of a grade 10 algebra document, the pipeline processed a page containing a worked example for solving the quadratic equation x² - 5x + 6 = 0. The page had two content zones: a native text layer carrying the problem statement and algebraic solution steps, and a raster image carrying a labeled number line showing the solution interval [2, 3].
The page returned 340 characters from native extraction, placing it above the 10-character fast-lane threshold. The image-area ratio was 0.31 - the diagram occupied roughly a third of the page. Because the character density signal was high, the routing gate sent the page to the fast lane. The algebraic content extracted correctly, including the step-by-step factoring and root identification. The number line diagram produced no output.
The resulting record contained a complete algebraic worked example with no geometric component. It passed the LaTeX integrity gate, because the algebraic content was syntactically valid. No pipeline stage flagged the record as incomplete. Without ground-truth comparison, this record is indistinguishable from one that was fully extracted.
This case illustrates the primary failure mode of the routing heuristic: pages where high character density and significant image area co-occur are likely to be sent to the fast lane, with image content silently dropped.
7. Stable and Fast: Quality Still Depends on the Source
The hybrid configuration is viable for unattended processing on standard hardware. Memory remains bounded because each batch is flushed explicitly before the next loads. Per-document time is under two minutes on documents of typical length, which indicates the pipeline can handle a large corpus in reasonable wall-clock time without GPU acceleration.
These results cover pipeline stability only. The pipeline completed, wrote records, and did not crash during overnight runs. Whether those records accurately represent the source content depends on source document quality and the routing gate's classification accuracy, neither of which was formally evaluated in this experiment.
8. Three Failure Modes That Remain Open
8.1 Verification cannot fix bad scans
The AI verification step was designed to detect structural anomalies in OCR output - merged paragraphs, formulas broken across lines - without rewriting content. When Tesseract processed pages with heavy blur, binding shadow, or significant page skew, the output was too degraded for the model to identify any structural pattern. It returned low-confidence flags rather than corrections. The verification step provides useful signal on moderate noise; it fails to recover content from poor-quality source scans. These pages likely need deskewing and contrast normalization before OCR runs.
8.2 The routing gate misclassifies edge cases
The routing signals are heuristics set by inspection, with no calibration against labeled data. A page with a single short problem statement above a large diagram may fall below the character threshold and be sent to the slow lane unnecessarily. A page with dense native text and a diagram at moderate area - as in the example above - may be sent to the fast lane with image content silently missed. The thresholds were set by inspection during this experiment. Their error rate on the full corpus is not known. Calibrating them would require a labeled sample of pages with known extraction ground truth.
8.3 The LaTeX gate checks syntax, not mathematical meaning
The integrity gate reliably caught syntactically broken expressions: unclosed braces, truncated macro names, and known OCR substitution patterns. A formula with correct brace balance and intact macro names passes every check even when its mathematical content is wrong: a fraction with numerator and denominator transposed by OCR clears all three heuristics and enters the clean output as an incorrect record. Detecting this class of error requires a math-aware parser or a model capable of interpreting mathematical meaning, neither of which was in scope here.
9. What This Experiment Did Not Measure
The test corpus was a convenience sample of available PDFs, with no stratification by grade level, publisher, or page-type distribution. Performance figures may not generalize to the full textbook corpus.
Routing thresholds were set by inspection, with no optimization against a labeled set. The false-positive and false-negative rates of the page classification gate on the full corpus are not known.
Processing time figures reflect a single hardware configuration - Intel Core processor, 16 GB RAM, no GPU. They will vary with document complexity and the proportion of slow-lane pages in the corpus.
The flagged-record review queue was not evaluated during this experiment. The fraction of flagged records that are recoverable versus unrecoverable from source degradation is unknown.
No semantic validation of mathematical content was performed. The clean output may contain records with valid LaTeX syntax that encode mathematically incorrect content.
10. Two Prerequisites Before a Production Run
Before a production pipeline run, two things are required. First, the routing gate thresholds need calibration against a manually labeled sample of 100 to 200 pages with known ground-truth extraction - one that includes pages where high character count and moderate image area co-occur, as in the example above. Without this, the gate's error rate on content-rich mixed pages remains uncharacterized. Second, pages routed to the slow lane that fail Tesseract quality checks need image preprocessing - deskewing and contrast normalization - applied before OCR runs. Without this step, the verification stage receives input too degraded to analyze and returns only low-confidence flags.
Footnotes
1 PyMuPDF (fitz) version 1.23.x was used for native text extraction. Tesseract version 5.3.x with the vie trained data for Vietnamese was used for OCR on image-dominant pages. Gemini Flash (December 2025 version) was the verification model.
2 The LaTeX validation heuristics documented in Section 3.5 are specific to Tesseract's observed failure modes on Vietnamese math content. They are not a general LaTeX validator. A comprehensive validator would require a full TeX parser.
3 Performance figures in Section 5 are approximate measurements from timed test runs. Hardware configuration: Intel Core processor, 16 GB RAM, no GPU acceleration for OCR workloads. Exact figures will vary with document complexity and page-type distribution.
December 25, 2025 - End of Stage 2. Stage 3 begins production pipeline build.