MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the limitations of conventional retrieval-augmented generation (RAG) systems in long-document, multi-hop visual question answering, where single-pass retrieval mechanisms hinder performance. To overcome this, the authors propose a visually aware agent framework that iteratively retrieves and synthesizes information through multi-turn interactions to answer complex queries. The core innovation is the Similarity-based Policy Optimization (SPO) algorithm, which constructs a more accurate baseline estimate by weighting trajectory semantic similarities, thereby effectively mitigating baseline bias in multi-step reinforcement learning. Integrated with the Qwen3 series of vision-language models and RAG, the proposed approach achieves a 10.4% absolute improvement over existing methods on the MMLongBench-Doc benchmark. Moreover, SPO outperforms GRPO by 5.0% and 6.1% when applied to Qwen3-8B and Qwen3-4B, respectively.

Technology Category

Application Category

📝 Abstract

Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state's baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.

Problem

Research questions and friction points this paper is trying to address.

long document visual question answering

multi-hop queries

retrieval-augmented generation

multi-turn reinforcement learning

information seeking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn Reinforcement Learning

Similarity-based Policy Optimization

Long Document Visual QA