GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing approaches to multi-image visual grounding are constrained by single-target assumptions and limited task formulations, hindering their applicability to generalized localization scenarios. This work formally defines the generalized multi-image visual grounding task for the first time and introduces GeM-VG, a multimodal large language model tailored for this setting. To support comprehensive evaluation, we construct MG-Data-240K, a large-scale dataset encompassing multiple targets and complex cross-image relationships. We further propose a hybrid reinforcement fine-tuning strategy that integrates chain-of-thought reasoning with direct answering, augmented by a rule-based reward mechanism to enhance cross-image perception and reasoning. GeM-VG achieves state-of-the-art performance, outperforming prior best models by 2.0% on MIG-Bench and 9.7% on MC-Bench, while also improving single-image grounding by 9.1% on ODINW, all without compromising its general multi-image understanding capabilities.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

Problem

Research questions and friction points this paper is trying to address.

multi-image visual grounding

generalized grounding

multimodal large language models

cross-image reasoning

target localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized Multi-image Visual Grounding

Multimodal Large Language Models

Hybrid Reinforcement Finetuning