CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work investigates the counterfactual reasoning capability of vision-language models (VLMs) for video understanding—specifically, their ability to infer causal outcomes under hypothetical interventions. To this end, we introduce CounterVQA, the first benchmark dedicated to evaluating video-based counterfactual reasoning, featuring three progressively challenging categories of counterfactual questions. Systematic evaluation reveals substantial limitations in existing VLMs’ capacity to model multi-hop causal chains. We propose CFGPT, a novel method integrating cross-modal feature alignment and counterfactual knowledge distillation to transfer reasoning capabilities from strong teacher models to lightweight student models. Experiments demonstrate that CFGPT achieves significant performance gains across all difficulty levels of CounterVQA, substantially narrowing the gap between open-source and proprietary VLMs on complex counterfactual video question answering. This work establishes a new benchmark and an effective technical framework for video-level causal reasoning.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

Problem

Research questions and friction points this paper is trying to address.

Evaluating counterfactual reasoning capabilities in Vision-Language Models for videos

Addressing performance degradation on complex multi-hop causal chain questions

Improving visual counterfactual reasoning by transferring language modality capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing CounterVQA benchmark for counterfactual reasoning

Developing CFGPT post-training method for visual reasoning

Distilling language modality to enhance video understanding

🔎 Similar Papers

VideoQA in the Era of LLMs: An Empirical Study