Training Multi-Image Vision Agents via End2End Reinforcement Learning

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing open-source vision-language models (VLMs) are predominantly constrained to single-image inputs, limiting their applicability to real-world multi-image question answering (QA). This work introduces IMAgent—the first end-to-end reinforcement learning (RL)-based visual agent explicitly designed for multi-image QA, overcoming the single-image bottleneck through autonomous tool invocation and cross-image visual reasoning. Key contributions include: (1) MIFG-QA, a novel synthetic framework for multi-image QA generating 10K diverse, compositional samples; (2) a dual-tool mechanism integrating visual reflection and confirmation to enhance reasoning reliability; and (3) an action-trajectory two-level masking strategy ensuring stable, purely RL-driven tool usage without supervision. Evaluated on MIFG-QA, IMAgent significantly outperforms all baselines while preserving strong performance on standard single-image benchmarks—achieving robust multi-image reasoning without supervised fine-tuning.

Technology Category

Application Category

📝 Abstract

Recent VLM-based agents aim to replicate OpenAI O3's ``thinking with images" via tool use, but most open-source methods limit input to a single image, falling short on real-world multi-image QA tasks. To address this, we propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning dedicated for complex multi-image tasks. By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs to fully activate the tool-use potential of the base VLM. Through manual verification, we obtain MIFG-QA, comprising 10k samples for training and evaluation. With deeper reasoning steps, VLMs may increasingly ignore visual inputs. We therefore develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content during inference. Benefiting from our well-designed action-trajectory two-level mask strategy, IMAgent achieves stable tool use behavior via pure RL training without requiring costly supervised fine-tuning data. Extensive experiments demonstrate that IMAgent maintains strong performance on existing single-image benchmarks while achieving substantial improvements on our proposed multi-image dataset, with our analysis providing actionable insights for the research community. Codes and data will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Develops an open-source vision agent for multi-image QA tasks

Generates challenging multi-image QA pairs to enhance VLM tool use

Creates tools for visual reflection to maintain image attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end reinforcement learning for multi-image tasks

Multi-agent system generating challenging QA pairs

Specialized tools for visual reflection and confirmation

🔎 Similar Papers

No similar papers found.