Multimodal OCR: Parse Anything from Documents

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional OCR systems focus solely on text recognition and struggle to interpret graphical elements such as charts and tables, resulting in significant loss of semantic information in document understanding. This work proposes dots.mocr, a novel approach that treats graphical components as first-class parsing targets alongside text, enabling unified modeling and end-to-end generation of structured textual representations for multimodal documents. Leveraging a large-scale data engine built from PDFs, web pages, and SVGs, the method employs staged pretraining followed by supervised fine-tuning to train a 3-billion-parameter model. Evaluated on the olmOCR Bench, dots.mocr achieves a new state-of-the-art score of 83.9 and ranks second only to Gemini 3 Pro in the OCR Arena, while notably surpassing Gemini 3 Pro in the quality of generated SVG outputs from graphical content.

Technology Category

Application Category

📝 Abstract
We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.
Problem

Research questions and friction points this paper is trying to address.

Multimodal OCR
document parsing
graphics parsing
semantic reconstruction
structured output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal OCR
structured document parsing
graphics-to-code
end-to-end multimodal training
SVG reconstruction
🔎 Similar Papers
No similar papers found.
Handong Zheng
Handong Zheng
Unknown affiliation
Y
Yumeng Li
hi lab, Xiaohongshu Inc
Kaile Zhang
Kaile Zhang
The Hong Kong Polytechnic University
psycholinguisticsneurolinguisticsphoneticsspeech perception
Liang Xin
Liang Xin
Nanyang Technological University
Deep Learning
G
Guangwei Zhao
hi lab, Xiaohongshu Inc
H
Hao Liu
hi lab, Xiaohongshu Inc
J
Jiayu Chen
hi lab, Xiaohongshu Inc
Jie Lou
Jie Lou
Xiaohongshu
AlignmentRLHF
J
Jiyu Qiu
hi lab, Xiaohongshu Inc
Q
Qi Fu
hi lab, Xiaohongshu Inc
R
Rui Yang
hi lab, Xiaohongshu Inc
S
Shuo Jiang
hi lab, Xiaohongshu Inc
Weijian Luo
Weijian Luo
Peking University
Human-preferred Generative ModelsLarge Vision-language Models
Weijie Su
Weijie Su
Associate Professor, University of Pennsylvania
Machine LearningDifferential PrivacyHigh-Dimensional StatisticsOptimizationDeep Learning
W
Weijun Zhang
hi lab, Xiaohongshu Inc
Xingyu Zhu
Xingyu Zhu
Princeton University
Y
Yabin Li
hi lab, Xiaohongshu Inc
Yiwei Ma
Yiwei Ma
Stevens Institute of Technology
Y
Yu Chen
hi lab, Xiaohongshu Inc
Z
Zhaohui Yu
hi lab, Xiaohongshu Inc
G
Guang Yang
hi lab, Xiaohongshu Inc
C
Colin Zhang
hi lab, Xiaohongshu Inc
L
Lei Zhang
hi lab, Xiaohongshu Inc
Y
Yuliang Liu
Huazhong University of Science and Technology
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR