Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing video understanding benchmarks lack systematic evaluation of counterfactual reasoning, hindering assessment of multimodal large language models’ (MLLMs) dynamic logical reasoning capabilities across abstract–concrete and perceptual–cognitive dimensions. Method: We introduce COVER, the first multidimensional benchmark for video counterfactual reasoning, featuring coverage-based counterfactual modeling, subproblem-driven reasoning attribution, multi-model performance decoupling analysis, and a unified video–language joint reasoning protocol. COVER employs structured subproblem decomposition to enable fine-grained characterization of reasoning capabilities. Contribution/Results: It is the first to uncover a strong correlation between structured reasoning and video understanding robustness. Extensive experiments on leading commercial and open-source MLLMs demonstrate COVER’s capacity to localize interpretable performance gaps. COVER establishes a new evaluation standard for dynamic-scene logical reasoning, advancing rigorous, transparent assessment of MLLM reasoning fidelity in complex spatiotemporal contexts.

Technology Category

Application Category

📝 Abstract

Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce extbf{COVER} ( extbf{underline{CO}}unterfactual extbf{underline{V}}id extbf{underline{E}}o extbf{underline{R}}easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs' logical reasoning abilities in dynamic environments.

Problem

Research questions and friction points this paper is trying to address.

Counterfactual reasoning in video understanding is underexplored.

COVER benchmark evaluates MLLMs' reasoning across multiple dimensions.

Structured sub-question analysis improves counterfactual reasoning performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces COVER for counterfactual video reasoning

Decomposes queries into structured sub-questions

Enhances model reasoning for robust video understanding

🔎 Similar Papers

TVBench: Redesigning Video-Language Evaluation