Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the robustness and efficiency bottlenecks in multimodal content extraction—encompassing text, tables, formulas, and figures—within document intelligence, particularly under challenging conditions such as multilingual settings, handwritten inputs, and rare characters. To this end, we propose a decoupled and feature-reusable document parsing framework that integrates a dynamic-resolution Vision Transformer visual encoder with a prompt-guided Youtu-LLM-2B large language model. Our approach introduces a novel dual high-parallel decoding mechanism combining token-level and query-level parallelism, achieving 5–11× and 2× speedups respectively while maintaining output quality. The method attains state-of-the-art performance on both OmniDocBench and olmOCR-bench, significantly enhancing structured document parsing efficiency and cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract
This paper presents Youtu-Parsing, an efficient and versatile document parsing model designed for high-performance content extraction. The architecture employs a native Vision Transformer (ViT) featuring a dynamic-resolution visual encoder to extract shared document features, coupled with a prompt-guided Youtu-LLM-2B language model for layout analysis and region-prompted decoding. Leveraging this decoupled and feature-reusable framework, we introduce a high-parallelism decoding strategy comprising two core components: token parallelism and query parallelism. The token parallelism strategy concurrently generates up to 64 candidate tokens per inference step, which are subsequently validated through a verification mechanism. This approach yields a 5--11x speedup over traditional autoregressive decoding and is particularly well-suited for highly structured scenarios, such as table recognition. To further exploit the advantages of region-prompted decoding, the query parallelism strategy enables simultaneous content prediction for multiple bounding boxes (up to five), providing an additional 2x acceleration while maintaining output quality equivalent to standard decoding. Youtu-Parsing encompasses a diverse range of document elements, including text, formulas, tables, charts, seals, and hierarchical structures. Furthermore, the model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content. Extensive evaluations demonstrate that Youtu-Parsing achieves state-of-the-art (SOTA) performance on both the OmniDocBench and olmOCR-bench benchmarks. Overall, Youtu-Parsing demonstrates significant experimental value and practical utility for large-scale document intelligence applications.
Problem

Research questions and friction points this paper is trying to address.

document parsing
content extraction
structured document understanding
multilingual recognition
handwritten text
Innovation

Methods, ideas, or system contributions that make the work stand out.

high-parallelism decoding
token parallelism
query parallelism
region-prompted decoding
document parsing
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
K
Kun Yin
Y
Yunfei Wu
B
Bing Liu
Z
Zhongpeng Cai
X
Xiaotian Li
H
Huang Chen
X
Xin Li
H
Haoyu Cao
Y
Yinsong Liu
Deqiang Jiang
Deqiang Jiang
腾讯优图实验室
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
Y
Yunsheng Wu
Q
Qianyu Li
A
Antai Guo
Y
Yanzhen Liao
Y
Yanqiu Qu
H
Haodong Lin
C
Chengxu He
S
Shuangyin Liu