GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to multi-image visual grounding are constrained by single-target assumptions and limited task formulations, hindering their applicability to generalized localization scenarios. This work formally defines the generalized multi-image visual grounding task for the first time and introduces GeM-VG, a multimodal large language model tailored for this setting. To support comprehensive evaluation, we construct MG-Data-240K, a large-scale dataset encompassing multiple targets and complex cross-image relationships. We further propose a hybrid reinforcement fine-tuning strategy that integrates chain-of-thought reasoning with direct answering, augmented by a rule-based reward mechanism to enhance cross-image perception and reasoning. GeM-VG achieves state-of-the-art performance, outperforming prior best models by 2.0% on MIG-Bench and 9.7% on MC-Bench, while also improving single-image grounding by 9.1% on ODINW, all without compromising its general multi-image understanding capabilities.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.
Problem

Research questions and friction points this paper is trying to address.

multi-image visual grounding
generalized grounding
multimodal large language models
cross-image reasoning
target localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized Multi-image Visual Grounding
Multimodal Large Language Models
Hybrid Reinforcement Finetuning
Chain-of-Thought Reasoning
MG-Data-240K
🔎 Similar Papers
No similar papers found.
S
Shurong Zheng
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Yousong Zhu
Yousong Zhu
Associate Professor, Chinese Academy of Sciences, Institute of Automation
Multimodal Large Language ModelsSelf-supervised LearningObject Detection
H
Hongyin Zhao
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Fan Yang
Fan Yang
Professor, Yau Mathematical Sciences Center, Tsinghua University
observational studiesmissing datacensoring by deathmediation
Yufei Zhan
Yufei Zhan
Institute of Automation, Chinese Academy of Science
Computer VisionLarge Multimodal ModelsGrounding and Detection
M
Ming Tang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China