Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF

Technology Category

Application Category

📝 Abstract
Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the dots.ocr model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.
Problem

Research questions and friction points this paper is trying to address.

document parsing
vision-language model
inference latency
long-form documents
auto-regressive generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
document parsing
vision-language model
training-free acceleration
layout-aware parallel decoding
🔎 Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3
W
Wenhui Liao
South China University of Technology, Guangzhou, China
H
Hongliang Li
South China University of Technology, Guangzhou, China
P
Pengyu Xie
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Xinyu Cai
Xinyu Cai
Shanghai Artificial Intelligence Laboratory
Artificial IntelligenceAutonomous Driving
Yufan Shen
Yufan Shen
Zhejiang University
MLLMGUI Agent
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
Q
Qi Qin
Shanghai Artificial Intelligence Laboratory, Shanghai, China
S
Shenglong Ye
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Tianbin Li
Tianbin Li
Shanghai Artificial Intelligence Laboratory
Machine LearningComputer VisionGeneral Intelligence
Ming Hu
Ming Hu
Monash University | Shanghai AI Laboratory
Junjun He
Junjun He
Shanghai Jiao Tong University
Yihao Liu
Yihao Liu
Shanghai Artificial Intelligence Laboratory
computer visionmultimodal generationimage restoration
W
Wenhai Wang
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Min Dou
Min Dou
Shanghai AI Laboratory
Autonomous DrivingMLLMEmbodied AI
Bin Fu
Bin Fu
Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences
computer visionscene understandingscene OCRfont generation
Botian Shi
Botian Shi
Shanghai Artificial Intelligence Laboratory
VLMsDocument UnderstandingAutonomous Driving
Yu Qiao
Yu Qiao
Professor of Shanghai AI Laboratory; Shenzhen Institutes of Advanced Technology, CAS
Computer VisionPattern RecognitionLarge Multimodal ModelLarge Language Model
Lianwen Jin
Lianwen Jin
Professor of Electronic and Information Engineering, South China University of Technology
Optical Character Recognition (OCR)Computer VisionDocument AIMultimodal LLMs