Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding benchmarks lack systematic evaluation of counterfactual reasoning, hindering assessment of multimodal large language models’ (MLLMs) dynamic logical reasoning capabilities across abstract–concrete and perceptual–cognitive dimensions. Method: We introduce COVER, the first multidimensional benchmark for video counterfactual reasoning, featuring coverage-based counterfactual modeling, subproblem-driven reasoning attribution, multi-model performance decoupling analysis, and a unified video–language joint reasoning protocol. COVER employs structured subproblem decomposition to enable fine-grained characterization of reasoning capabilities. Contribution/Results: It is the first to uncover a strong correlation between structured reasoning and video understanding robustness. Extensive experiments on leading commercial and open-source MLLMs demonstrate COVER’s capacity to localize interpretable performance gaps. COVER establishes a new evaluation standard for dynamic-scene logical reasoning, advancing rigorous, transparent assessment of MLLM reasoning fidelity in complex spatiotemporal contexts.

Technology Category

Application Category

📝 Abstract
Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce extbf{COVER} ( extbf{underline{CO}}unterfactual extbf{underline{V}}id extbf{underline{E}}o extbf{underline{R}}easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs' logical reasoning abilities in dynamic environments.
Problem

Research questions and friction points this paper is trying to address.

Counterfactual reasoning in video understanding is underexplored.
COVER benchmark evaluates MLLMs' reasoning across multiple dimensions.
Structured sub-question analysis improves counterfactual reasoning performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces COVER for counterfactual video reasoning
Decomposes queries into structured sub-questions
Enhances model reasoning for robust video understanding
🔎 Similar Papers
No similar papers found.
Qiji Zhou
Qiji Zhou
Westlake University
Natural Language ProcessingComputational LinguisticsLogicMultimodal Models
Y
Yifan Gong
College of Computer Science and Technology, Hangzhou Dianzi University
Guangsheng Bao
Guangsheng Bao
Ph.D. Candidate, Westlake University & Zhejiang University.
ReasoningLarge Language ModelNatural Language Generation
H
Hongjie Qiu
College of Computer Science and Technology, Hangzhou Dianzi University
J
Jinqiang Li
College of Computer Science and Technology, Hangzhou Dianzi University
Xiangrong Zhu
Xiangrong Zhu
PhD Student
Human computer interaction (HCI)
Huajian Zhang
Huajian Zhang
Stony Brook University
Natural language generation
Y
Yue Zhang
School of Engineering, Westlake University