VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

πŸ“… 2026-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

189K/year
πŸ€– AI Summary
Existing spatiotemporal reasoning benchmarks predominantly rely on static images or passively collected videos, which are insufficient for evaluating the fine-grained spatiotemporal understanding capabilities of multimodal large language models. To address this limitation, this work proposes a novel paradigm of active video synthesis, leveraging a multi-agent generation pipeline coupled with human-in-the-loop quality control to produce highly controllable and diverse synthetic videos along with corresponding question-answer pairs. The authors introduce a three-dimensional taxonomy encompassing spatial scale, viewpoint, and dynamic change, and design a hierarchical task suite that enables decoupled assessment of low-level perception and high-level reasoning. The resulting benchmark, VGenST-Bench, facilitates fine-grained diagnostic evaluation and comprehensive assessment of models’ spatiotemporal reasoning abilities.
πŸ“ Abstract
Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.
Problem

Research questions and friction points this paper is trying to address.

spatio-temporal reasoning
Multimodal Large Language Models
benchmark
video synthesis
fine-grained evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

active video synthesis
spatio-temporal reasoning
multimodal large language models
benchmark generation
hierarchical task suite
πŸ”Ž Similar Papers
No similar papers found.