MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal benchmarks emphasize the fluency of chain-of-thought (CoT) generation in vision-language models but neglect whether such reasoning is genuinely grounded in visual evidence and logically coherent. Method: We introduce MM-CoT, the first diagnostic benchmark for multimodal CoT, jointly evaluating visual grounding and logical coherence through a structured event-chain selection task. It incorporates visual consistency verification, causal reasoning, and commonsense judgment, augmented by orthogonal constraints and adversarial perturbations to disentangle grounding failures from logical inconsistencies. Contribution/Results: Experiments reveal that state-of-the-art vision-language models perform substantially below human levels on MM-CoT, exposing a critical gap between generative fluency and reasoning fidelity. Moreover, MM-CoT exhibits low correlation with existing benchmarks, confirming its unique ability to quantify the long-overlooked dimension of reasoning reliability.

Technology Category

Application Category

📝 Abstract
The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.
Problem

Research questions and friction points this paper is trying to address.

Evaluates visual grounding and logical coherence in multimodal reasoning
Diagnoses reasoning failures via adversarial visual-consistent and logical-valid distractors
Measures true reasoning fidelity beyond generative fluency in models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark tests visual grounding and logical coherence
Models select event chains meeting orthogonal constraints
Adversarial distractors expose distinct reasoning failures
🔎 Similar Papers
No similar papers found.