GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Existing neural video generation methods struggle to evaluate physical plausibility and semantic fidelity in complex multi-agent scenes, primarily due to the absence of precise spatiotemporal ground-truth annotations. To address this, this work proposes GEST-Engine, a system built upon the Graphs of Events in Space and Time (GEST) framework, which constructs the first multi-agent video corpus annotated with per-frame 3D spatial relation graphs and event-level temporal alignments, providing fine-grained spatiotemporal ground truth. This resource enables training, evaluation, and analysis of video models, and demonstrates—across eleven spatiotemporal reasoning tasks—the superiority of self-supervised video encoders in modeling spatial structure. Human evaluations and video captioning experiments further confirm substantial improvements in physical validity and semantic alignment.

Technology Category

Application Category

📝 Abstract
Generating complex multi-actor scenario videos remains difficult even for state-of-the-art neural generators, while evaluating them is hard due to the lack of ground truth for physical plausibility and semantic faithfulness. We introduce GTASA, a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings, and the system that produced it based on Graphs of Events in Space and Time (GEST): GEST-Engine. We compare our method with both open and closed source neural generators and prove both qualitatively (human evaluation of physical validity and semantic alignment) and quantitatively (via training video captioning models) the clear advantages of our method. Probing four frozen video encoders across 11 spatiotemporal reasoning tasks enabled by GTASA's exact 3D ground truth reveals that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.
Problem

Research questions and friction points this paper is trying to address.

video generation
multi-actor scenarios
ground truth annotations
physical plausibility
semantic faithfulness
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatiotemporal reasoning
ground truth annotation
multi-actor video generation
spatial relation graph
self-supervised video encoder
🔎 Similar Papers
No similar papers found.