🤖 AI Summary
This work addresses the limitations of existing vision-centric Retrieval-Augmented Generation (RAG) systems, which rely on generic retrieval signals and struggle to support fine-grained visual semantic understanding required for complex reasoning. The authors propose a unified reinforcement learning framework that models visual information acquisition as a hierarchical sequential decision-making process, enabling large vision-language model agents to jointly optimize retrieval, re-ranking, active perception, and reasoning. The approach introduces a progressive evidence refinement mechanism—from document retrieval to image selection and region cropping—and employs a dense multi-reward strategy for task-aware end-to-end training. Built upon Group Relative Policy Optimization (GRPO), the method eliminates the need for a separate value network and leverages high-quality human-annotated reasoning trajectories for optimization. It achieves substantial improvements over prior methods, outperforming the previous best reinforcement learning approach by up to 17.7% across three benchmarks.
📝 Abstract
Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.