OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

πŸ“… 2026-01-29
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing OCR systems, which predominantly focus on text-centric tasks and struggle to effectively process multimodal content in visually dense images such as charts and web pages. To bridge this gap, we propose OCRVerseβ€”the first end-to-end, full-stack OCR framework that unifies modeling for both text- and vision-centric tasks. We construct a large-scale dataset encompassing diverse document types and complex visual layouts, and introduce a two-stage supervised fine-tuning (SFT) followed by reinforcement learning (RL) training strategy, augmented with a domain-adaptive reward mechanism to mitigate inter-domain data conflicts and output format discrepancies. Experimental results demonstrate that OCRVerse achieves performance on par with current state-of-the-art open- and closed-source large models across both task categories.

Technology Category

Application Category

πŸ“ Abstract
The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (Text-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (Vision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose OCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.
Problem

Research questions and friction points this paper is trying to address.

OCR
vision-language models
text-centric OCR
vision-centric OCR
multimodal data
Innovation

Methods, ideas, or system contributions that make the work stand out.

holistic OCR
vision-language models
multi-domain training
reinforcement learning
text-centric and vision-centric OCR
πŸ”Ž Similar Papers
No similar papers found.
Yufeng Zhong
Yufeng Zhong
Meituan
Multimodal LLMComputer Vision
Lei Chen
Lei Chen
Meituan
MLLMComputer Vision
X
Xuanle Zhao
Meituan
Wenkang Han
Wenkang Han
Zhejiang University
Vision-Language ModelAgentic Intelligence
L
Liming Zheng
Meituan
J
Jing Huang
Meituan
D
Deyang Jiang
Meituan
Y
Yilin Cao
Meituan
Lin Ma
Lin Ma
Meituan
Multimodal LLMComputer Vision
Z
Zhixiong Zeng
Meituan