Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited cross-image reasoning and generalization capabilities of multimodal large language models (MLLMs) in complex multi-image scenarios, this paper proposes a reinforcement learning (RL)-based post-training framework. Our method integrates synthetic chain-of-thought data cold-starting, efficient LoRA fine-tuning, rule-guided rejection sampling, and model merging to construct a high-quality multi-image grounding RL dataset. We further design an instruction-aware reward mechanism to jointly optimize visual referring expression understanding and multimodal instruction comprehension. Evaluated on MIG-Bench, our approach achieves a +9.04% absolute improvement; it also yields +4.98% gains on cross-domain benchmarks, and +3.1% and +2.4% improvements on the BLINK and MMIU subsets, respectively. These results demonstrate substantial enhancement in modeling structural and semantic relationships across multiple images, significantly advancing MLLMs’ generalization capacity in multi-image settings.

Technology Category

Application Category

📝 Abstract
Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, achieving +9.04% improvements on MIG-Bench and +4.98% improvements on several out-of-domain reasoning grounding benchmarks over the SFT baseline. Furthermore, our approach exhibits strong generalization in multi-image perception, with gains of +3.1% and +2.4% over the base model on subsets of the BLINK and MMIU benchmarks, respectively.
Problem

Research questions and friction points this paper is trying to address.

Enhance multi-image reasoning in MLLMs via RL
Address cross-image generalization gaps in grounding tasks
Improve multimodal instruction handling with CoT and SFT
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning post-training for MLLMs
Synthesized chain-of-thought data initialization
Rule-based RL for optimal reasoning paths
🔎 Similar Papers
Bob Zhang
Bob Zhang
University of Macau
Biometricspattern recognitionimage processing
H
Haoran Li
University of Science and Technology of China
T
Tao Zhang
Wuhan University
C
Cilin Yan
Xiaohongshu Inc.
J
Jiayin Cai
Xiaohongshu Inc.
X
Xiaolong Jiang
Xiaohongshu Inc.
Yanbin Hao
Yanbin Hao
Hefei University of Technology
Video retrievalvideo action recognitionhashingVideo Hyperlinking