Attention Grounded Enhancement for Visual Document Retrieval

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Visual document retrieval suffers from coarse-grained global relevance labels, causing models to rely on superficial cues and struggle to capture implicit semantic associations—particularly hindering performance on non-extractive queries. To address this, we propose AGREE, the first framework to leverage cross-modal attention maps from multimodal large language models (MLLMs) as proxy local supervision signals, jointly optimizing global relevance prediction and fine-grained region-level alignment. AGREE integrates screenshot-based document encoding with a late-interaction architecture, enabling end-to-end attention-guided training. This enhances deep semantic alignment between queries and salient document regions. On the ViDoRe V2 benchmark, AGREE significantly outperforms global-supervision-only baselines, achieving higher retrieval accuracy and improved interpretability. By grounding alignment in MLLM-derived attention priors, AGREE advances visual document retrieval from shallow pattern matching toward principled, semantically grounded alignment.

Technology Category

Application Category

📝 Abstract

Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a extbf{A}ttention- extbf{G}rounded extbf{RE}triever extbf{E}nhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://anonymous.4open.science/r/AGREE-2025.

Problem

Research questions and friction points this paper is trying to address.

Visual document retrieval struggles with implicit semantic connections

Current retrievers rely on surface-level cues without regional guidance

Training lacks local supervision for identifying relevant document regions

Innovation

Methods, ideas, or system contributions that make the work stand out.

AGREE uses cross-modal attention for local supervision

AGREE combines local and global signals for training

AGREE aligns query terms with document regions

🔎 Similar Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding