Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the feasibility and fidelity of large-scale text-to-video (T2V) and image-to-video (I2V) models as simulators for multi-pedestrian trajectory generation. Addressing the lack of standardized evaluation, we propose the first systematic benchmark: a densely annotated crowd prompt set, enabling bird’s-eye-view 2D trajectory reconstruction from generated videos—without camera calibration—and alignment with ground-truth trajectories at the initial frame. Experiments reveal that state-of-the-art models implicitly capture plausible multi-agent interaction priors, producing intuitive dynamics such as collision avoidance and following. However, structural failures persist, including person merging, abrupt appearance/disappearance, and spatial inconsistency. Our study is the first to empirically validate video diffusion models for crowd behavior simulation, establishing a reproducible, trajectory-based evaluation framework. This benchmark bridges generative modeling and behavioral simulation, offering a novel pathway toward improving physical plausibility and social coherence in video generation.

Technology Category

Application Category

📝 Abstract
Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.
Problem

Research questions and friction points this paper is trying to address.

Evaluating video models as pedestrian trajectory simulators
Assessing plausibility of multi-agent interactions in generated videos
Developing evaluation protocols for text-to-video pedestrian dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating video models as pedestrian trajectory simulators
Reconstructing 2D bird's-eye view trajectories without camera parameters
Developing prompt suite for diverse pedestrian interactions
🔎 Similar Papers
No similar papers found.