V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current video-language models (Video-LLMs) exhibit severe deficiencies in spatiotemporal relational reasoning. Existing benchmarks predominantly assess object presence, failing to distinguish genuine causal inference from spurious co-occurrence bias. Method: We introduce V-STaR—the first benchmark dedicated to evaluating spatiotemporal reasoning—featuring the novel Reverse Spatiotemporal Reasoning (RSTR) task, which systematically probes models’ chain-of-causal understanding of “when, where, and what.” Our methodology includes a decomposition-based evaluation paradigm; a GPT-4–driven, human-verified, and rule-augmented Chain-of-Thought (CoT) question generation pipeline; and a fine-grained, multi-stage reasoning-path scoring mechanism. Results: Experiments across 14 state-of-the-art models reveal an average spatiotemporal reasoning accuracy 42.6% below human performance, providing the first empirical evidence that current Video-LLMs rely heavily on co-occurrence memorization rather than logical, causal reasoning.

Technology Category

Application Category

📝 Abstract

Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also"reason through a sequential spatio-temporal logic"in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained"memory"of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluates Video-LLMs' spatio-temporal reasoning in videos.

Assesses object interactions, not just presence, in videos.

Measures comprehension of sequential logic in video events.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces V-STaR benchmark for Video-LLMs

Decomposes video understanding into RSTR task

Uses GPT-4 for semi-automated CoT questions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs