🤖 AI Summary
This paper addresses core challenges in OCR conversion of digital printed documents (e.g., PDFs), including mathematical formula recognition, table parsing, and multi-column layout reconstruction. Methodologically, it proposes a reinforcement learning framework grounded in verifiable rewards: (1) a fine-grained reward mechanism guided by binary unit tests to enforce structural and semantic correctness; (2) a controllable synthetic document generation pipeline that produces large-scale, diverse training and evaluation samples covering complex layouts; and (3) a dedicated 7B-parameter vision-language model, olmOCR-2-7B-1025, trained end-to-end. Evaluated on the olmOCR-Bench English benchmark, the method achieves state-of-the-art performance, significantly outperforming prior approaches on three critical tasks—mathematical formula conversion, table parsing, and multi-column ordering—thereby validating the effectiveness of the “verifiable-reward + synthetic-test-driven” paradigm.
📝 Abstract
We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.