🤖 AI Summary
This work addresses the significant performance degradation of multimodal large language models when processing ultra-long documents, primarily caused by low signal-to-noise ratios and scarce supervision signals. To mitigate these challenges, the authors propose a structured “analyze–locate–reason” pipeline integrated within a two-stage training framework. First, high-quality supervision data is generated via knowledge distillation; subsequently, an evidence-aware grouped relative position optimization and an evidence-guided dynamic resolution allocation strategy are introduced to enhance the model’s ability to locate and reason over critical evidence. The proposed approach achieves state-of-the-art performance on both in-domain and out-of-domain tasks, effectively generalizing from short-document training to ultra-long document understanding and substantially improving the practicality of vision-augmented retrieval-augmented generation (RAG) systems.
📝 Abstract
Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``\textbf{Analysis}, \textbf{Localization} and \textbf{Reasoning}'' workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.