🤖 AI Summary
Current causal video question answering (VideoQA) models predominantly adopt black-box, end-to-end architectures, limiting their capacity for higher-order causal reasoning and suffering from poor interpretability and generalization. To address these limitations, we propose a modular two-stage framework: (1) In the first stage, a large language model (LLM) automatically generates structured natural-language causal chains, which are then refined via a dedicated Causal Chain Extractor (CCE); (2) In the second stage, a Causal Chain-Driven Answerer (CCDA) explicitly decouples causal reasoning from answer prediction, using the extracted causal chain as explicit guidance. We further introduce CauCo, a novel evaluation metric specifically designed for causal reasoning in VideoQA. Our approach achieves state-of-the-art performance across three mainstream benchmarks. Extensive experiments demonstrate substantial improvements in reasoning transparency, user trust, and cross-domain generalization—validating causal chains as reusable, interpretable, and effective reasoning engines for VideoQA.
📝 Abstract
Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that produces answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating high-quality causal chains from existing datasets using large language models. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/