🤖 AI Summary
Modeling realistic pedestrian dynamics in crowded public scenes remains challenging for generative models, particularly when leveraging image-to-video (I2V) diffusion transformers (DiTs) without explicit behavioral priors.
Method: We propose a trajectory-benchmark-driven keyframe-conditioned generation paradigm, establishing a unified evaluation framework that jointly assesses visual fidelity and trajectory-level dynamics. We introduce quantitative trajectory metrics—including displacement distributions, velocity statistics, and interaction density—to measure dynamic plausibility and temporal consistency.
Contribution/Results: Our experiments demonstrate that DiT-based I2V models, despite lacking explicit pedestrian behavior modeling, spontaneously generate videos whose trajectory statistics closely approximate those of real-world pedestrian data across multiple metrics. This work pioneers the use of DiT-based I2V models as implicit pedestrian dynamical simulators and provides a reproducible, trajectory-aware benchmark for evaluating generative models in social behavior modeling.
📝 Abstract
Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.