CLASH: A Benchmark for Cross-Modal Contradiction Detection

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing benchmarks assume cross-modal input consistency, making them inadequate for evaluating models’ ability to detect image-text contradictions—a critical limitation for real-world reliability. To address this, we introduce CLASH, the first fine-grained cross-modal contradiction detection benchmark. Built upon COCO images, CLASH features text captions exhibiting object-level and attribute-level contradictions, formatted as multiple-choice and open-ended questions. It leverages controllable text generation, automated filtering, and rigorous human validation to construct a large-scale training set and a high-quality diagnostic subset. CLASH reveals, for the first time, systematic modality bias and category-specific fragility in mainstream multimodal models. Extensive experiments demonstrate that fine-tuning on CLASH significantly improves contradiction detection performance, establishing a new standard and effective pathway for advancing cross-modal consistency modeling.

Technology Category

Application Category

📝 Abstract

Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.

Problem

Research questions and friction points this paper is trying to address.

Detecting contradictions between images and text in multimodal inputs

Evaluating model capability to prevent hallucinations through conflict recognition

Addressing systematic modality biases in cross-modal contradiction detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with COCO images and contradictory captions

Automated quality checks and human-verified diagnostic set

Targeted fine-tuning enhances cross-modal conflict detection

🔎 Similar Papers

No similar papers found.