🤖 AI Summary
Multimodal large language models (MLLMs) suffer from hallucination and missed detections in fine-grained visual difference recognition. To address this, we propose MED—the first large-scale, controllable image editing dataset tailored for micro-edit detection, comprising over 50K samples spanning 11 semantic micro-edit categories. We further design a feature-level consistency supervised fine-tuning framework: (i) introducing a novel minimal-edit image pair generation paradigm; (ii) establishing a fine-grained visual editing classification taxonomy; and (iii) incorporating a feature-stability-driven objective to suppress visual representation drift. Evaluated on our newly constructed Micro Edit Detection benchmark, our method significantly outperforms strong baselines—including GPT-4o—achieving higher difference identification accuracy and reduced hallucination. Moreover, it delivers consistent performance gains across general-purpose vision-language tasks, such as image captioning and visual question answering.
📝 Abstract
Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks but still struggle with fine-grained visual differences, leading to hallucinations or missed semantic shifts. We attribute this to limitations in both training data and learning objectives. To address these issues, we propose a controlled data generation pipeline that produces minimally edited image pairs with semantically aligned captions. Using this pipeline, we construct the Micro Edit Dataset (MED), containing over 50K image-text pairs spanning 11 fine-grained edit categories, including attribute, count, position, and object presence changes. Building on MED, we introduce a supervised fine-tuning (SFT) framework with a feature-level consistency loss that promotes stable visual embeddings under small edits. We evaluate our approach on the Micro Edit Detection benchmark, which includes carefully balanced evaluation pairs designed to test sensitivity to subtle visual variations across the same edit categories. Our method improves difference detection accuracy and reduces hallucinations compared to strong baselines, including GPT-4o. Moreover, it yields consistent gains on standard vision-language tasks such as image captioning and visual question answering. These results demonstrate the effectiveness of combining targeted data and alignment objectives for enhancing fine-grained visual reasoning in MLLMs.