FireRed-OCR Technical Report

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the structural hallucination issues of general-purpose vision-language models (VLMs) in complex document parsing, which hinder their ability to meet industrial OCR demands for pixel-level accuracy. To this end, we propose FireRed-OCR, a framework built upon Qwen3-VL that transforms a general VLM into a high-precision expert for structured document understanding. Our approach features a “geometry + semantics” data factory and a three-stage progressive training strategy: multi-task pre-alignment, instruction fine-tuning, and format-constrained Group Relative Policy Optimization. We innovatively integrate geometric feature clustering with multidimensional labeling to synthesize high-quality training data and introduce a format-aware reinforcement learning mechanism to enforce strict compliance with output structure and syntax. Evaluated on OmniDocBench v1.5, FireRed-OCR achieves a state-of-the-art overall score of 92.94%, outperforming DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics.

Technology Category

Application Category

📝 Abstract
We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.
Problem

Research questions and friction points this paper is trying to address.

structural hallucination
OCR
Vision-Language Models
document parsing
industrial OCR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry + Semantics Data Factory
Three-Stage Progressive Training
Format-Constrained GRPO
Structural Document Parsing
VLM Specialization
🔎 Similar Papers
No similar papers found.
H
Hao Wu
Super Intelligence Team, Xiaohongshu Inc.
H
Haoran Lou
Super Intelligence Team, Xiaohongshu Inc.
X
Xinyue Li
Super Intelligence Team, Xiaohongshu Inc.
Z
Zuodong Zhong
Super Intelligence Team, Xiaohongshu Inc.
Z
Zhaojun Sun
Super Intelligence Team, Xiaohongshu Inc.
P
Phellon Chen
Super Intelligence Team, Xiaohongshu Inc.
Xuanhe Zhou
Xuanhe Zhou
Assistant Professor, Shanghai Jiao Tong University
Data ManagementArtificial Intelligence
K
Kai Zuo
Super Intelligence Team, Xiaohongshu Inc.
Y
Yibo Chen
Super Intelligence Team, Xiaohongshu Inc.
Xu Tang
Xu Tang
Xiaohongshu. 个人主页: https://tangxuvis.github.io/
Face DetectionFace RecognitionGANVideo UnderstandingText Video Retrieval
Yao Hu
Yao Hu
浙江大学
Machine Learning
B
Boxiang Zhou
Super Intelligence Team, Xiaohongshu Inc.
Jian Wu
Jian Wu
Unknown affiliation
Music Generation
Yongji Wu
Yongji Wu
UC Berkeley
Machine Learning SystemsDatacenter Networks
W
Wenxin Yu
Super Intelligence Team, Xiaohongshu Inc.
Y
Yingmiao Liu
Super Intelligence Team, Xiaohongshu Inc.
Yuhao Huang
Yuhao Huang
Shenzhen University
Medical Image ComputingUltrasoundModel Robustness
Manjie Xu
Manjie Xu
Peking University
Cognitive Reasoning
G
Gang Liu
Super Intelligence Team, Xiaohongshu Inc.
Y
Yidong Ma
Super Intelligence Team, Xiaohongshu Inc.
Z
Zhichao Sun
Super Intelligence Team, Xiaohongshu Inc.
C
Changhao Qiao
Super Intelligence Team, Xiaohongshu Inc.