ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) struggle with long, multi-page documents due to difficulties in cross-page information integration, passive navigation, and poor generalization. To address these challenges, we propose an active document understanding framework grounded in reinforcement learning. Our method introduces a learnable “fetch” action for page access, coupled with visual-semantic anchoring to enable precise cross-page evidence retrieval. We design a hierarchical reward mechanism—combining structural alignment and answer correctness—and enforce training stability via dual-path KL divergence constraints. Page indices are leveraged for efficient, structure-aware navigation. Crucially, our approach eliminates reliance on fixed reasoning templates, endowing VLMs with autonomous navigation and dynamic evidence aggregation capabilities. Evaluated on five long-document benchmarks, it achieves state-of-the-art performance, improving navigation efficiency (reducing average page accesses by 37%) and reasoning accuracy (+4.2–9.8%).

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.
Problem

Research questions and friction points this paper is trying to address.

Active navigation in long visually complex documents
Overcoming rigid pipelines for vision-language model integration
Addressing training instability from numerous visual tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning framework for active document navigation
Fetch action accessing pages by index for structure
Visual-semantic anchoring stabilizes training with KL-divergence
🔎 Similar Papers
No similar papers found.