Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the challenge of long-form visual question answering over multi-page documents, which demands intricate reasoning across semantic content, layout structures, and visual elements—a task where existing OCR-free approaches struggle to balance model capacity and reasoning accuracy. The authors propose an OCR-free agent framework that formulates the problem as a progressive evidence aggregation process, leveraging thumbnail overviews, semantic retrieval-based navigation, and targeted page acquisition to enable grounded reasoning within a structured working memory. A novel active interactive reasoning mechanism is introduced, combining imitation learning with Group Relative Policy Optimization to efficiently select relevant evidence and generate answers without increasing the number of input pages. Experiments demonstrate that the method outperforms open-source baselines across five benchmarks, approaching the performance of closed-source models and achieving up to a 47.9% improvement over RAG baselines in out-of-domain settings, thereby validating the efficacy of the proposed selective attention mechanism.

Technology Category

Application Category

📝 Abstract
Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^*$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^*$ begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-$V^*$ balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-$V^*$ outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf{47.9\%} over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.
Problem

Research questions and friction points this paper is trying to address.

Multi-page Document VQA
Visual Reasoning
OCR-free
Evidence Aggregation
Long Document Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR-free
agentic reasoning
evidence aggregation
visual retrieval
multi-page DocVQA
🔎 Similar Papers
No similar papers found.
Y
Yuanlei Zheng
School of Software Engineering, Huazhong University of Science and Technology
P
Pei Fu
MiLM Plus, Xiaomi Inc.
H
Hang Li
MiLM Plus, Xiaomi Inc.
Z
Ziyang Wang
School of Software Engineering, Huazhong University of Science and Technology
Yuyi Zhang
Yuyi Zhang
South China University of Technology
Computer VisionDiffusionImage generationHandwritten Character RecognitionOCR
W
Wenyu Ruan
School of Software Engineering, Huazhong University of Science and Technology
X
Xiaojin Zhang
School of Computer Science and Technology, Huazhong University of Science and Technology
Z
Zhongyu Wei
School of Data Science, Fudan University
Zhenbo Luo
Zhenbo Luo
XiaoMi
Vision Language ModelComputer Vision
Jian Luan
Jian Luan
Toshiba, Microsoft, Xiaomi
LLMVLMTTSSinging Synthesis
W
Wei Chen
School of Software Engineering, Huazhong University of Science and Technology
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR