Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address evidence sparsity (i.e., scattered clues across pages and modalities) and input redundancy that impair model reasoning in long-document understanding, this paper proposes SLEUTH, a multi-agent framework. It employs a two-stage pipeline—coarse-grained retrieval followed by fine-grained collaborative reasoning—integrating retrieval-augmented generation, multimodal clue identification, visual evidence filtering, query-aware reasoning planning, and context distillation. SLEUTH dynamically selects salient textual and visual evidence and constructs hierarchical, high-density multimodal contexts. Its key innovations include: (1) the first application of multi-agent collaboration to long-document understanding, enabling model-agnostic, adaptive reasoning strategy planning; and (2) high-fidelity multimodal context distillation. SLEUTH achieves state-of-the-art performance across multiple long-document benchmarks. Ablation studies confirm the efficacy of each component, significantly improving both accuracy and robustness of vision-language models on complex document tasks.

Technology Category

Application Category

📝 Abstract

Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.

Problem

Research questions and friction points this paper is trying to address.

Addresses evidence sparsity in long-document understanding

Reduces redundancy from lengthy multimodal inputs

Enhances Vision Language Models for scattered multi-page clues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework orchestrates retrieval and analysis

Filters salient visual evidence like tables and charts

Synthesizes distilled evidence-dense multimodal context for prediction

🔎 Similar Papers

No similar papers found.