CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing video question-answering benchmarks struggle to evaluate the fine-grained, evidence-localizable capabilities of vision-language models in spatiotemporal causal reasoning. To address this gap, this work introduces a novel benchmark centered on causal link grounding, constructed through human-AI collaboration to collect complex causal questions and annotate multi-step spatiotemporal evidence chains comprising temporal segments and bounding box trajectories. We propose a new evaluation paradigm that requires models to localize multiple spatiotemporal evidence segments to answer causal questions and introduce metrics that jointly assess answer correctness and visual grounding fidelity. The dataset comprises 1,015 videos and 2,066 high-quality causal questions. Experimental results reveal that current models perform poorly on this task, highlighting their limitations in constructing precise and interpretable causal chains.

📝 Abstract

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.

Problem

Research questions and friction points this paper is trying to address.

causal reasoning

video question answering

spatio-temporal reasoning

vision-language models

causal chain

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reasoning

spatio-temporal grounding

video question answering