VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing spatiotemporal reasoning benchmarks predominantly rely on static images or passively collected videos, which are insufficient for evaluating the fine-grained spatiotemporal understanding capabilities of multimodal large language models. To address this limitation, this work proposes a novel paradigm of active video synthesis, leveraging a multi-agent generation pipeline coupled with human-in-the-loop quality control to produce highly controllable and diverse synthetic videos along with corresponding question-answer pairs. The authors introduce a three-dimensional taxonomy encompassing spatial scale, viewpoint, and dynamic change, and design a hierarchical task suite that enables decoupled assessment of low-level perception and high-level reasoning. The resulting benchmark, VGenST-Bench, facilitates fine-grained diagnostic evaluation and comprehensive assessment of models’ spatiotemporal reasoning abilities.

📝 Abstract

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

spatio-temporal reasoning

Multimodal Large Language Models

benchmark

video synthesis

fine-grained evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

active video synthesis

spatio-temporal reasoning

multimodal large language models