Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Document image parsing faces challenges including complex layouts of mixed-content elements (text, formulas, tables, figures), layout degradation, and computational inefficiency. To address these, we propose a two-stage “analysis–parsing” paradigm: first generating reading-order layout anchors, then performing parallel decoding of element contents conditioned on heterogeneous anchors—effectively decoupling layout recognition from content generation. Our contributions include: (1) a multi-granularity heterogeneous anchor sequence modeling scheme; (2) task-conditioned prompt injection; and (3) a two-stage self-feedback mechanism. We construct a large-scale training dataset of 30 million samples and adopt a lightweight multimodal Transformer architecture. Our method achieves state-of-the-art performance at both page-level and element-level on mainstream and in-house benchmarks, with significant gains in structural accuracy and a 2.3× speedup in inference latency. Code and models are publicly released.

Technology Category

Application Category

📝 Abstract
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present extit{Dolphin} ( extit{ extbf{Do}cument Image extbf{P}arsing via extbf{H}eterogeneous Anchor Prompt extbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin
Problem

Research questions and friction points this paper is trying to address.

Addresses document image parsing complexity with intertwined elements
Overcomes integration overhead and efficiency bottlenecks in parsing
Enhances layout structure and performance via parallel content parsing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyze-then-parse paradigm for document parsing
Heterogeneous anchor prompting for parallel parsing
Lightweight architecture with parallel parsing mechanism
🔎 Similar Papers
No similar papers found.
H
Hao Feng
ByteDance
S
Shu Wei
ByteDance
X
Xiang Fei
ByteDance
W
Wei Shi
ByteDance
Y
Yingdong Han
ByteDance
Lei Liao
Lei Liao
ByteDance Inc.
Jinghui Lu
Jinghui Lu
ByteDance Inc., School of Computer Science, University College Dublin
Natural Language ProcessingMulti-ModalityLLMHuman-in-the-loop Learning
B
Binghong Wu
ByteDance
Q
Qi Liu
ByteDance
C
Chunhui Lin
ByteDance
Jingqun Tang
Jingqun Tang
ByteDance Inc.
Computer VisionDocument IntelligenceMLLMMultimodal Generative Models
H
Hao Liu
ByteDance
C
Can Huang
ByteDance