π€ AI Summary
Existing vision-language models (VLMs) suffer from excessive parameter counts and high computational overhead, hindering efficient multi-granularity joint recognition of text and mathematical formulas in documents. To address this, we propose UniRecβa lightweight unified recognition model with only 0.1B parameters. Our method introduces a hierarchical supervision training paradigm and a semantic-decoupled tokenizer to explicitly disentangle textual and formulaic semantics while capturing structural variability. We construct UniRec40M, a large-scale, 40-million-sample hybrid dataset spanning diverse document domains. UniRec employs a lightweight vision-language modeling framework coupled with a multi-granularity sequence recognition architecture. On bilingual (Chinese/English), multi-domain document understanding benchmarks, it consistently outperforms state-of-the-art VLMs and domain-specific parsing models, achieving 2β9Γ faster inference speed. This advancement significantly enhances the practicality of on-device document understanding.
π Abstract
Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$ imes$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.