Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited explicit layout awareness of conventional end-to-end OCR models, which hinders accurate parsing of structured documents. To overcome this, the authors propose a 4-billion-parameter end-to-end vision-language model featuring an innovative Layout-as-Thought mechanism that introduces an optional structured layout reasoning phase during inference. This phase is triggered by special “thought tokens,” effectively balancing end-to-end efficiency with enhanced layout comprehension. The model unifies capabilities such as image-to-Markdown generation and prompt-driven multitask learning. It achieves state-of-the-art performance on OmniDocBench v1.5 and OlmOCR Bench, outperforming mainstream models like Gemini-3.1-Pro in key information extraction and delivering leading results across multiple public document understanding benchmarks.

Technology Category

Application Category

📝 Abstract
We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.
Problem

Research questions and friction points this paper is trying to address.

end-to-end OCR
layout analysis
document understanding
structured representation
complex document layout
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end OCR
Layout-as-Thought
vision-language model
document intelligence
structured layout representation
Daxiang Dong
Daxiang Dong
Baidu
Deep Learning、Natural Language Processing、Data Mining
M
Mingming Zheng
D
Dong Xu
C
Chunhua Luo
B
Bairong Zhuang
Y
Yuxuan Li
R
Ruoyun He
H
Haoran Wang
W
Wenyu Zhang
W
Wenbo Wang
Y
Yicheng Wang
X
Xue Xiong
A
Ayong Zheng
X
Xiaoying Zuo
Z
Ziwei Ou
J
Jingnan Gu
Q
Quanhao Guo
J
Jianmin Wu
Dawei Yin
Dawei Yin
Senior Director, Head of Search Science at Baidu
Machine LearningWeb MiningData Mining
Dou Shen
Dou Shen
Baidu Inc
Data MiningMachine LearningOnline Advertising