🤖 AI Summary
Existing OCR systems struggle to reconstruct structurally intact and compilable LaTeX documents from scientific PDFs, often failing to preserve critical invariants such as section hierarchy, float placement, and citation consistency. This work introduces, for the first time, an end-to-end OCR task specifically designed for generating compilable LaTeX output. The authors construct a large-scale training corpus, TexOCR-Train, and a multidimensional evaluation benchmark, TexOCR-Bench, and train a 2-billion-parameter model leveraging supervised fine-tuning and reinforcement learning. A key innovation is a verifiable reward mechanism based on LaTeX unit tests, which significantly enhances structural fidelity and compilation success rates. Comprehensive evaluation of 21 state-of-the-art models on TexOCR-Bench demonstrates that the proposed approach substantially outperforms existing baselines in both structural faithfulness and end-to-end compilability.
📝 Abstract
Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.