CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

๐Ÿ“… 2026-04-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

194K/year
๐Ÿค– AI Summary
This work addresses the challenges faced by multimodal large language models in fine-grained multi-image understanding, including spatial hallucination, attention leakage, and failures in object constancy, which are exacerbated by the reliance of existing methods on costly human annotations or large-scale chain-of-thought data. To overcome these limitations, the authors propose the Compositional Grounded Contrast (CGC) framework, which leverages low-cost single-image grounding annotations to construct multi-image training samples. CGC introduces semantic-disentangled distractors and cross-view associations through inter- and intra-image contrastive learning, augmented with a rule-based spatial reward mechanism. Operating within a Think-before-Grounding paradigm, the framework jointly optimizes source-image attribution, spatial alignment, and structured output generationโ€”all without requiring dense manual annotations. The method achieves state-of-the-art performance on MIG-Bench and VLM2-Bench and significantly outperforms the Qwen3-VL-8B baseline across diverse multimodal benchmarks, including MathVista, MuirBench, MMStar, MMMU, and BLINK.

Technology Category

Application Category

๐Ÿ“ Abstract
Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).
Problem

Research questions and friction points this paper is trying to address.

fine-grained multi-image understanding
spatial hallucination
attention leakage
object constancy
multimodal large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compositional Grounded Contrast
Multi-Image Understanding
Inter-Image Contrast
Intra-Image Contrast
Spatial Reward
๐Ÿ”Ž Similar Papers
No similar papers found.