🤖 AI Summary
This work investigates the feasibility and fidelity of large-scale text-to-video (T2V) and image-to-video (I2V) models as simulators for multi-pedestrian trajectory generation. Addressing the lack of standardized evaluation, we propose the first systematic benchmark: a densely annotated crowd prompt set, enabling bird’s-eye-view 2D trajectory reconstruction from generated videos—without camera calibration—and alignment with ground-truth trajectories at the initial frame. Experiments reveal that state-of-the-art models implicitly capture plausible multi-agent interaction priors, producing intuitive dynamics such as collision avoidance and following. However, structural failures persist, including person merging, abrupt appearance/disappearance, and spatial inconsistency. Our study is the first to empirically validate video diffusion models for crowd behavior simulation, establishing a reproducible, trajectory-based evaluation framework. This benchmark bridges generative modeling and behavioral simulation, offering a novel pathway toward improving physical plausibility and social coherence in video generation.
📝 Abstract
Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.