ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the limitations of existing region-based visual evidence injection methods, which often impair global scene understanding and incur high computational overhead, as well as adaptive feature selection strategies that typically rely on fine-grained supervision or complex heuristics. To overcome these issues, the authors propose ROVER—a lightweight, learnable plug-in module that employs an object-centric differential attention mechanism to inject ternary tokens at each localization step. This enables dynamic aggregation of intra-image cues and cross-object, cross-image historical perceptual evidence without requiring additional supervision. Integrated into Qwen2.5-VL-7B and trained with an interleaved SFT-to-GRPO pipeline, ROVER achieves notable performance gains: +4.8% in answer accuracy and +14.6% in grounding accuracy on MM-GCoT, +8.6% on VideoEspresso, and an average improvement of 4.7% across multiple benchmarks, demonstrating strong transferability.

📝 Abstract

Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.

Problem

Research questions and friction points this paper is trying to address.

grounded reasoning

object-centric representation

visual evidence routing

multimodal large language models

inter-object relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

object-centric reasoning

visual evidence routing

multimodal large language models