🤖 AI Summary
Existing benchmarks assume cross-modal input consistency, making them inadequate for evaluating models’ ability to detect image-text contradictions—a critical limitation for real-world reliability. To address this, we introduce CLASH, the first fine-grained cross-modal contradiction detection benchmark. Built upon COCO images, CLASH features text captions exhibiting object-level and attribute-level contradictions, formatted as multiple-choice and open-ended questions. It leverages controllable text generation, automated filtering, and rigorous human validation to construct a large-scale training set and a high-quality diagnostic subset. CLASH reveals, for the first time, systematic modality bias and category-specific fragility in mainstream multimodal models. Extensive experiments demonstrate that fine-tuning on CLASH significantly improves contradiction detection performance, establishing a new standard and effective pathway for advancing cross-modal consistency modeling.
📝 Abstract
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.