🤖 AI Summary
This work addresses the inability of existing multimodal large language models to simultaneously recognize real-world traffic accidents and reject adversarial counterfactual scenarios while maintaining contrastive consistency. To tackle this, we introduce CCTVBench, a novel benchmark constructed from real accident videos paired with minimally altered counterfactual videos generated by world models, organized into structured quadruplet questions that enforce consistent decision logic. We propose the first contrastive consistency evaluation framework tailored for traffic video question answering, enabling fine-grained failure diagnosis, and introduce Contrastive Temporal Consistency Decoding (C-TCD). Experiments reveal a significant gap between standard QA performance and contrastive consistency in current models, while C-TCD effectively enhances both instance-level answer accuracy and contrastive consistency.
📝 Abstract
Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, a contrastive decoding approach leveraging a semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.