OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Image Difference Captioning (IDC) datasets suffer from limited scope—narrow scene coverage—and shallow granularity—coarse-grained descriptions—hindering fine-grained understanding in complex, dynamic environments. To address this, we introduce OmniDiff, the first fine-grained IDC benchmark spanning both real-world and 3D-synthetic scenes, comprising 324 diverse scenarios, 12 change categories, and human-annotated captions averaging 60 words each. Methodologically, we propose a plug-and-play Multi-scale Difference Perception (MDP) module and build M$^3$Diff, an end-to-end multimodal large model integrating visual encoding, cross-modal alignment, and MDP. Our approach achieves state-of-the-art performance across five benchmarks—including Spot-the-Diff and CLEVR-Change—with significant improvements in cross-scenario difference recognition accuracy. All data, code, and models are publicly released.

Technology Category

Application Category

📝 Abstract
Image Difference Captioning (IDC) aims to generate natural language descriptions of subtle differences between image pairs, requiring both precise visual change localization and coherent semantic expression. Despite recent advancements, existing datasets often lack breadth and depth, limiting their applicability in complex and dynamic environments: (1) from a breadth perspective, current datasets are constrained to limited variations of objects in specific scenes, and (2) from a depth perspective, prior benchmarks often provide overly simplistic descriptions. To address these challenges, we introduce OmniDiff, a comprehensive dataset comprising 324 diverse scenarios-spanning real-world complex environments and 3D synthetic settings-with fine-grained human annotations averaging 60 words in length and covering 12 distinct change types. Building on this foundation, we propose M$^3$Diff, a MultiModal large language model enhanced by a plug-and-play Multi-scale Differential Perception (MDP) module. This module improves the model's ability to accurately identify and describe inter-image differences while maintaining the foundational model's generalization capabilities. With the addition of the OmniDiff dataset, M$^3$Diff achieves state-of-the-art performance across multiple benchmarks, including Spot-the-Diff, IEdit, CLEVR-Change, CLEVR-DC, and OmniDiff, demonstrating significant improvements in cross-scenario difference recognition accuracy compared to existing methods. The dataset, code, and models will be made publicly available to support further research.
Problem

Research questions and friction points this paper is trying to address.

Lack of diverse datasets for image difference captioning.
Existing benchmarks provide overly simplistic descriptions.
Need for accurate and detailed inter-image difference recognition.
Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniDiff dataset with 324 diverse scenarios
M$^3$Diff model with Multi-scale Differential Perception
State-of-the-art performance in multiple benchmarks
🔎 Similar Papers
No similar papers found.
Y
Yuan Liu
School of Artificial Intelligence, Beijing Normal University
Saihui Hou
Saihui Hou
Beijing Normal University
Deep LearningComputer VisionMultimodal Large Language Models
S
Saijie Hou
School of Artificial Intelligence, Beijing University of Posts and Telecommunications
J
Jiabao Du
School of Artificial Intelligence, Beijing Normal University
S
Shibei Meng
School of Artificial Intelligence, Beijing Normal University
Yongzhen Huang
Yongzhen Huang
School of Artificial Intelligence, Beijing Normal University
Computer VisionPattern RecognitionDeep Learning