DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

๐Ÿ“… 2025-12-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Multimodal large language models (MLLMs) exhibit significant limitations in fine-grained visual perception and precise spatial reasoning, particularly in identifying and localizing all differences between highly similar image pairs. Method: We propose the Difference Grounding (DiG) proxy task frameworkโ€”a novel training paradigm that operates without prior assumptions about the number of differences, enabling systematic improvement in MLLMsโ€™ ability to detect and precisely localize all disparities. Contributions/Results: (1) The first difference localization paradigm free from pre-specified difference counts; (2) A controllable 3D-rendering data generation pipeline enabling explicit control over difference types, quantities, and spatial distributions; (3) A progressive curriculum learning strategy based on difference complexity. Through multi-stage fine-tuning and cross-task transfer, DiG achieves an average 12.6% improvement in fine-grained localization accuracy on RefCOCO benchmarks and general vision benchmarks, with markedly enhanced generalization.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhances fine-grained perception in multimodal large language models
Improves spatial reasoning by identifying differences in image pairs
Transfers learned skills to downstream visual perception benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential grounding framework for fine-grained perception
Automated 3D rendering pipeline for scalable data generation
Curriculum learning from single to multiple differences
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zhou Tao
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
Shida Wang
Shida Wang
National University of Singapore
Sequence ModellingLarge Language Model
Y
Yongxiang Hua
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
H
Haoyu Cao
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
L
Linli Xu
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence