VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
Existing vision-language model benchmarks struggle to evaluate the understanding of spatiotemporal interactions involving multiple entities and actions in open-world scenarios, and lack a systematic diagnostic framework. This work proposes the first large-scale, interaction-aware diagnostic benchmark, which parses videos into structured annotations of entities, actions, and their dynamic relationships, establishing a unified interaction taxonomy. By integrating multi-source data, the benchmark comprises approximately 12K video-query pairs. Leveraging this resource, we conduct fine-grained evaluations across spatiotemporal axes on 11 state-of-the-art models, uncovering significant temporal and spatial biases and performance gaps that conventional metrics obscure. Our findings offer new directions for both model development and evaluation methodologies in vision-language understanding.
📝 Abstract
Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.
Problem

Research questions and friction points this paper is trying to address.

spatio-temporal understanding
vision-language models
video interaction
multi-entity
multi-action
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatio-temporal understanding
vision-language models
interaction-aware benchmark
multi-entity multi-action analysis
diagnostic evaluation
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30