UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing vision-language models (VLMs) suffer from excessive parameter counts and high computational overhead, hindering efficient multi-granularity joint recognition of text and mathematical formulas in documents. To address this, we propose UniRec—a lightweight unified recognition model with only 0.1B parameters. Our method introduces a hierarchical supervision training paradigm and a semantic-decoupled tokenizer to explicitly disentangle textual and formulaic semantics while capturing structural variability. We construct UniRec40M, a large-scale, 40-million-sample hybrid dataset spanning diverse document domains. UniRec employs a lightweight vision-language modeling framework coupled with a multi-granularity sequence recognition architecture. On bilingual (Chinese/English), multi-domain document understanding benchmarks, it consistently outperforms state-of-the-art VLMs and domain-specific parsing models, achieving 2–9× faster inference speed. This advancement significantly enhances the practicality of on-device document understanding.

Technology Category

Application Category

📝 Abstract

Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$ imes$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.

Problem

Research questions and friction points this paper is trying to address.

Unified recognition of text and formulas in documents

Reducing model size and computational demands for efficiency

Addressing structural and semantic challenges in mixed content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight 0.1B-parameter model for unified text and formula recognition

Hierarchical supervision training for multi-level structural comprehension

Semantic-decoupled tokenizer separating text and formula representations

🔎 Similar Papers

Enhancing Complex Formula Recognition with Hierarchical Detail-Focused Network