Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This paper addresses the critical gap in multimodal large language models’ (MLLMs) ability to detect and reason about cross-modal semantic inconsistencies in real-world complex layouts (e.g., web pages, slides, posters). To this end, we introduce MMIR—the first dedicated benchmark for Multimodal Inconsistency Reasoning. MMIR comprises 534 synthetically corrupted samples covering five fine-grained inconsistency types: factual conflict, identity mismatch, spatiotemporal misalignment, logical contradiction, and stylistic incoherence. Methodologically, we formally define and quantify multimodal inconsistency reasoning capability, and propose a novel unimodal prompting probe integrating Chain-of-Thought with Set-of-Marks. We conduct zero-shot evaluation across six state-of-the-art MLLMs. Results reveal that models with specialized multimodal reasoning architectures (e.g., o1) significantly outperform others; open-source models exhibit widespread fragility; and cross-modal inconsistency detection accuracy remains below 40%, substantially lagging behind intra-modal detection—highlighting a fundamental reasoning bottleneck. The MMIR benchmark is publicly released.

Technology Category

Application Category

📝 Abstract

Existing Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs, leaving open the question of whether they can handle inconsistencies in real-world, layout-rich content. To bridge this gap, we propose the Multimodal Inconsistency Reasoning (MMIR) benchmark to assess MLLMs' ability to detect and reason about semantic mismatches in artifacts such as webpages, presentation slides, and posters. MMIR comprises 534 challenging samples, each containing synthetically injected errors across five reasoning-heavy categories: Factual Contradiction, Identity Misattribution, Contextual Mismatch, Quantitative Discrepancy, and Temporal/Spatial Incoherence. We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts while open-source models remain particularly vulnerable to inconsistency errors. Detailed error analyses further show that models excel in detecting inconsistencies confined to a single modality, particularly in text, but struggle with cross-modal conflicts and complex layouts. Probing experiments reveal that single-modality prompting, including Chain-of-Thought (CoT) and Set-of-Mark (SoM) methods, yields marginal gains, revealing a key bottleneck in cross-modal reasoning. Our findings highlight the need for advanced multimodal reasoning and point to future research on multimodal inconsistency.

Problem

Research questions and friction points this paper is trying to address.

Assess MLLMs' ability in inconsistency detection.

Evaluate MLLMs on cross-modal conflict handling.

Identify bottlenecks in multimodal reasoning research.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Inconsistency Reasoning benchmark

Synthetic error injection assessment

Cross-modal conflict detection

🔎 Similar Papers

No similar papers found.