🤖 AI Summary
Existing video understanding benchmarks lack systematic evaluation of counterfactual reasoning, hindering assessment of multimodal large language models’ (MLLMs) dynamic logical reasoning capabilities across abstract–concrete and perceptual–cognitive dimensions.
Method: We introduce COVER, the first multidimensional benchmark for video counterfactual reasoning, featuring coverage-based counterfactual modeling, subproblem-driven reasoning attribution, multi-model performance decoupling analysis, and a unified video–language joint reasoning protocol. COVER employs structured subproblem decomposition to enable fine-grained characterization of reasoning capabilities.
Contribution/Results: It is the first to uncover a strong correlation between structured reasoning and video understanding robustness. Extensive experiments on leading commercial and open-source MLLMs demonstrate COVER’s capacity to localize interpretable performance gaps. COVER establishes a new evaluation standard for dynamic-scene logical reasoning, advancing rigorous, transparent assessment of MLLM reasoning fidelity in complex spatiotemporal contexts.
📝 Abstract
Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce extbf{COVER} ( extbf{underline{CO}}unterfactual extbf{underline{V}}id extbf{underline{E}}o extbf{underline{R}}easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs' logical reasoning abilities in dynamic environments.