DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit significant limitations in fine-grained visual perception and precise spatial reasoning, particularly in identifying and localizing all differences between highly similar image pairs. Method: We propose the Difference Grounding (DiG) proxy task framework—a novel training paradigm that operates without prior assumptions about the number of differences, enabling systematic improvement in MLLMs’ ability to detect and precisely localize all disparities. Contributions/Results: (1) The first difference localization paradigm free from pre-specified difference counts; (2) A controllable 3D-rendering data generation pipeline enabling explicit control over difference types, quantities, and spatial distributions; (3) A progressive curriculum learning strategy based on difference complexity. Through multi-stage fine-tuning and cross-task transfer, DiG achieves an average 12.6% improvement in fine-grained localization accuracy on RefCOCO benchmarks and general vision benchmarks, with markedly enhanced generalization.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhances fine-grained perception in multimodal large language models

Improves spatial reasoning by identifying differences in image pairs

Transfers learned skills to downstream visual perception benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential grounding framework for fine-grained perception

Automated 3D rendering pipeline for scalable data generation

Curriculum learning from single to multiple differences

🔎 Similar Papers

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision