Research Note Cadmus Pipeline Stage 2

Local PDF Extraction for Vietnamese Math Content: A Feasibility Study

A PDF page can contain both a native text layer and a raster image of the same mathematical content. Standard extraction reads the text layer and silently ignores the image. The result is a record that appears complete but is missing part of the problem. This report documents a four-day experiment testing whether a six-stage Python pipeline could produce reliable, memory-stable structured records from Vietnamese math PDFs on a single development laptop - and identifying exactly where the extraction breaks.

1. Mixed-Format PDFs and the Silent Miss

Vietnamese secondary math textbooks arrive as heterogeneous PDFs. A single file commonly mixes native digital text, scanned page images, embedded geometry diagrams, and mathematical notation rendered in raster, vector, or both. Standard PDF text extraction reads the document's internal character stream. It handles native text correctly. On image-rendered content, it produces no output. The record is simply absent.

This creates two compounding problems. First, pages with scanned math content are silently excluded unless the pipeline explicitly detects and routes them. Second, LaTeX notation extracted from either source type is frequently broken after OCR: missing braces, Unicode substitution for math operators, truncated macro names. A formula that appears to be LaTeX but is syntactically invalid cannot be rendered by a math parser and is unusable as training data.

Scope
Input documents contained mixed-format pages with native text, scanned image layers, mathematical diagrams, and LaTeX-style formula notation. The extraction target was a structured record per problem: question text, formula fields, solution steps, and answer label - all in Vietnamese with LaTeX notation preserved.

2. Why Silent Gaps Are Worse Than Crashes

A dataset with silent extraction failures has an unknown coverage rate. Records that appear structurally valid may be missing content: a problem statement without its associated diagram, or a formula with numerator and denominator transposed by OCR. This is harder to detect than an outright pipeline crash, because downstream processes receive a well-formed record with no indication that content is absent.

For VietAlpha, each such record becomes a data point pointing toward a wrong answer. The central research question was narrow: can the extraction process be made reliable and memory-stable enough to run unattended at scale on standard hardware? Discovering that answer here, on a small sample with no cloud costs, is far less expensive than discovering it during a production run.

3. Building the Pipeline

The pipeline was built incrementally over four days, each stage added to address a specific failure mode in the previous run. The result is a six-stage hybrid system.

3.1 Initial approach: uniform rasterization

The first implementation treated every page as an image, regardless of whether the underlying PDF contained extractable text. Each page was rendered to a bitmap at 300 DPI using pdf2image (a poppler wrapper), then passed to Tesseract with the Vietnamese language pack for OCR.

The intent was robustness: by treating everything as an image, the pipeline could handle both native text and scanned pages without a detection step. In practice, this was too heavy for the hardware. Rasterizing a 200-page book at 300 DPI produced approximately 2 GB of pixel data. The laptop overheated within minutes, and processing a single file took eight minutes - too slow for a corpus of thousands of pages.

3.2 Hybrid extraction with AI verification

The second version introduced two extraction paths. Pages with extractable text layers were handled by PyMuPDF's native extraction, which reads directly from the PDF character stream without rasterization. Pages yielding fewer than 10 characters from native extraction were treated as image-dominant and sent to Tesseract.

A lightweight Gemini model (Flash tier) was added as a structural check on OCR output. The model scanned each page's raw text for structural anomalies - table headers merged into paragraphs, formulas split across lines - and flagged them without rewriting the content. This kept verification fast and prevented the model from producing text not present in the source.

Design Note
The verifier flags structural errors without correcting them. When OCR output was too degraded for the model to identify structure, it returned low-confidence flags rather than specific corrections. This pointed toward a hard limit: the verification step can identify problems in moderate noise, but it cannot recover content from fundamentally poor source scans.

3.3 Per-document error isolation

As the pipeline grew more complex, single-document failures began crashing entire processing runs. A corrupted PDF - one with a missing page tree or an encoding error - propagated an exception that terminated the script. Overnight runs lost hours of work to a single bad file.

The codebase was refactored so each document is processed within its own exception boundary. On failure, the document path, page number, and exception class are written to a structured error log and the script moves to the next file without retry. This made unattended overnight runs stable.

3.4 Memory-bounded batch processing

Processing files sequentially without releasing memory caused RAM usage to climb until the system stalled. PyMuPDF and PIL objects tend to persist in memory after a file is finished, even when no longer referenced by application code.

Documents were grouped into fixed-size batches of five. After each batch completed, output was serialized to disk, all document handles were explicitly closed, image objects were deleted, and gc.collect() was called to force a full garbage collection cycle before the next batch loaded. This kept peak memory use bounded to approximately five documents' worth of bitmap and text data, rather than growing with corpus size.

3.5 LaTeX integrity validation

OCR frequently corrupted LaTeX notation. Common errors included unclosed braces (\frac{a}{b), Unicode symbols replacing math operators (degree symbols in place of \circ), and backslashes rendered as pipes or forward slashes.

A validation gate was added before records were written to the output store. The gate applied three checks: brace balance via a stack counter, detection of known OCR substitution patterns via regular expressions, and presence checks for truncated standard macro names. Records failing validation were written to a separate flagged queue for manual review, kept apart from the clean output.

3.6 Heuristic fast/slow routing

Even with selective rasterization, processing time remained dominated by pages sent to Tesseract. A routing gate was added before the extraction step to classify each page into a fast lane (native extraction only) or slow lane (rasterization, OCR, and verification).

The gate used two signals: the character count from PyMuPDF's direct extraction (below 10 characters per page, the page was treated as image-dominant) and the ratio of image object area to total page area, computed from the PDF's embedded image metadata. Pages above both thresholds took the fast lane. Pages falling below either threshold took the slow lane.

This reduced Tesseract calls to only pages that required them, cutting median per-document processing time from approximately eight minutes to under ninety seconds.

4. Pipeline Summary

The diagram below shows the six-stage pipeline as it existed at the end of the experiment.

Stage 1 File enumeration & error isolation
per-document exception boundary - log error class and path, continue
Stage 2 Batch grouping
5-document batches, serialize & gc.collect() between
Stage 3 Page classification gate
char density + image-area ratio → fast or slow lane
Route
Fast lane PyMuPDF native extraction
character stream read on text-layer pages
Slow lane Rasterize → Tesseract OCR → Gemini verification
300 DPI bitmap, vie language pack, flash-tier structural review
Stage 4 LaTeX integrity gate
brace balance, OCR substitution patterns, truncated macros
Stage 5 Structured record output
Clean records to JSONL; flagged records to review queue
Fast lane (native text extraction) Slow lane (rasterization + OCR + verification)
Six-stage pipeline as of December 25, 2025. The routing gate at Stage 3 is the primary latency reduction mechanism: most textbook pages contain extractable text layers and bypass Tesseract entirely.

5. Benchmarks

The table below compares the three main configurations tested during the experiment. Each was run on the same hardware against the same document sample.

Full rasterization Hybrid fast/slow routing Native extraction only
Median per-document time Single document, ~100 pages ∼8 min ∼90 sec ∼12 sec
Image-dominant page coverage Pages with no native text layer Full Full None (silent miss)
LaTeX extraction reliability Formulas structurally intact post-extraction Low (no gate) High (gated) High on native; zero on image pages
Memory stability Unattended overnight runs Crashes at scale Stable (batched gc) Stable
AI verification benefit Structural anomaly detection N/A Effective on moderate noise; degrades on heavy artifacts N/A
Pipeline configuration comparison. The highlighted column is the final hybrid configuration. "Native extraction only" reflects a hypothetical clean-source baseline included for reference; real corpus documents contain image-heavy pages that native extraction silently misses.

6. A Grade 10 Page That Looked Fully Extracted

During extraction of a grade 10 algebra document, the pipeline processed a page containing a worked example for solving the quadratic equation x² - 5x + 6 = 0. The page had two content zones: a native text layer carrying the problem statement and algebraic solution steps, and a raster image carrying a labeled number line showing the solution interval [2, 3].

The page returned 340 characters from native extraction, placing it above the 10-character fast-lane threshold. The image-area ratio was 0.31 - the diagram occupied roughly a third of the page. Because the character density signal was high, the routing gate sent the page to the fast lane. The algebraic content extracted correctly, including the step-by-step factoring and root identification. The number line diagram produced no output.

The resulting record contained a complete algebraic worked example with no geometric component. It passed the LaTeX integrity gate, because the algebraic content was syntactically valid. No pipeline stage flagged the record as incomplete. Without ground-truth comparison, this record is indistinguishable from one that was fully extracted.

This case illustrates the primary failure mode of the routing heuristic: pages where high character density and significant image area co-occur are likely to be sent to the fast lane, with image content silently dropped.

7. Stable and Fast: Quality Still Depends on the Source

The hybrid configuration is viable for unattended processing on standard hardware. Memory remains bounded because each batch is flushed explicitly before the next loads. Per-document time is under two minutes on documents of typical length, which indicates the pipeline can handle a large corpus in reasonable wall-clock time without GPU acceleration.

These results cover pipeline stability only. The pipeline completed, wrote records, and did not crash during overnight runs. Whether those records accurately represent the source content depends on source document quality and the routing gate's classification accuracy, neither of which was formally evaluated in this experiment.

8. Three Failure Modes That Remain Open

8.1 Verification cannot fix bad scans

The AI verification step was designed to detect structural anomalies in OCR output - merged paragraphs, formulas broken across lines - without rewriting content. When Tesseract processed pages with heavy blur, binding shadow, or significant page skew, the output was too degraded for the model to identify any structural pattern. It returned low-confidence flags rather than corrections. The verification step provides useful signal on moderate noise; it fails to recover content from poor-quality source scans. These pages likely need deskewing and contrast normalization before OCR runs.

8.2 The routing gate misclassifies edge cases

The routing signals are heuristics set by inspection, with no calibration against labeled data. A page with a single short problem statement above a large diagram may fall below the character threshold and be sent to the slow lane unnecessarily. A page with dense native text and a diagram at moderate area - as in the example above - may be sent to the fast lane with image content silently missed. The thresholds were set by inspection during this experiment. Their error rate on the full corpus is not known. Calibrating them would require a labeled sample of pages with known extraction ground truth.

8.3 The LaTeX gate checks syntax, not mathematical meaning

The integrity gate reliably caught syntactically broken expressions: unclosed braces, truncated macro names, and known OCR substitution patterns. A formula with correct brace balance and intact macro names passes every check even when its mathematical content is wrong: a fraction with numerator and denominator transposed by OCR clears all three heuristics and enters the clean output as an incorrect record. Detecting this class of error requires a math-aware parser or a model capable of interpreting mathematical meaning, neither of which was in scope here.

9. What This Experiment Did Not Measure

The test corpus was a convenience sample of available PDFs, with no stratification by grade level, publisher, or page-type distribution. Performance figures may not generalize to the full textbook corpus.

Routing thresholds were set by inspection, with no optimization against a labeled set. The false-positive and false-negative rates of the page classification gate on the full corpus are not known.

Processing time figures reflect a single hardware configuration - Intel Core processor, 16 GB RAM, no GPU. They will vary with document complexity and the proportion of slow-lane pages in the corpus.

The flagged-record review queue was not evaluated during this experiment. The fraction of flagged records that are recoverable versus unrecoverable from source degradation is unknown.

No semantic validation of mathematical content was performed. The clean output may contain records with valid LaTeX syntax that encode mathematically incorrect content.

10. Two Prerequisites Before a Production Run

Before a production pipeline run, two things are required. First, the routing gate thresholds need calibration against a manually labeled sample of 100 to 200 pages with known ground-truth extraction - one that includes pages where high character count and moderate image area co-occur, as in the example above. Without this, the gate's error rate on content-rich mixed pages remains uncharacterized. Second, pages routed to the slow lane that fail Tesseract quality checks need image preprocessing - deskewing and contrast normalization - applied before OCR runs. Without this step, the verification stage receives input too degraded to analyze and returns only low-confidence flags.


Footnotes

1    PyMuPDF (fitz) version 1.23.x was used for native text extraction. Tesseract version 5.3.x with the vie trained data for Vietnamese was used for OCR on image-dominant pages. Gemini Flash (December 2025 version) was the verification model.

2    The LaTeX validation heuristics documented in Section 3.5 are specific to Tesseract's observed failure modes on Vietnamese math content. They are not a general LaTeX validator. A comprehensive validator would require a full TeX parser.

3    Performance figures in Section 5 are approximate measurements from timed test runs. Hardware configuration: Intel Core processor, 16 GB RAM, no GPU acceleration for OCR workloads. Exact figures will vary with document complexity and page-type distribution.

December 25, 2025 - End of Stage 2. Stage 3 begins production pipeline build.