DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This work addresses the significant performance degradation of multimodal large language models when processing ultra-long documents, primarily caused by low signal-to-noise ratios and scarce supervision signals. To mitigate these challenges, the authors propose a structured “analyze–locate–reason” pipeline integrated within a two-stage training framework. First, high-quality supervision data is generated via knowledge distillation; subsequently, an evidence-aware grouped relative position optimization and an evidence-guided dynamic resolution allocation strategy are introduced to enhance the model’s ability to locate and reason over critical evidence. The proposed approach achieves state-of-the-art performance on both in-domain and out-of-domain tasks, effectively generalizing from short-document training to ultra-long document understanding and substantially improving the practicality of vision-augmented retrieval-augmented generation (RAG) systems.

Technology Category

Application Category

📝 Abstract
Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``\textbf{Analysis}, \textbf{Localization} and \textbf{Reasoning}'' workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.
Problem

Research questions and friction points this paper is trying to address.

long document understanding
Signal-to-Noise Ratio
supervision scarcity
evidence grounding
Multimodal Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured visual reasoning
evidence grounding
long document understanding
multimodal large language models
retrieval-augmented generation