🤖 AI Summary
Addressing the challenge of jointly modeling video understanding and decision-making for autonomous driving, this paper introduces VaViM/VaVAM—the first open-source video-to-action joint modeling framework. Methodologically, it employs autoregressive spatiotemporal token modeling and video generation pretraining, enabling end-to-end closed-loop trajectory generation from driving videos via representation transfer; its perception–action generalization is systematically evaluated in both open-loop and closed-loop simulation. Key contributions include: (1) the first empirical discovery of a scaling law linking semantic representation quality and safety in real-world driving scenarios under video pretraining; (2) validation of effective transferability of large-scale video generative models to embodied driving tasks; and (3) full open-sourcing of code and model weights to advance video-driven autonomous driving research. Experiments demonstrate substantial improvements in trajectory prediction robustness and safety under complex traffic conditions.
📝 Abstract
We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at https://github.com/valeoai/VideoActionModel