VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of jointly modeling video understanding and decision-making for autonomous driving, this paper introduces VaViM/VaVAM—the first open-source video-to-action joint modeling framework. Methodologically, it employs autoregressive spatiotemporal token modeling and video generation pretraining, enabling end-to-end closed-loop trajectory generation from driving videos via representation transfer; its perception–action generalization is systematically evaluated in both open-loop and closed-loop simulation. Key contributions include: (1) the first empirical discovery of a scaling law linking semantic representation quality and safety in real-world driving scenarios under video pretraining; (2) validation of effective transferability of large-scale video generative models to embodied driving tasks; and (3) full open-sourcing of code and model weights to advance video-driven autonomous driving research. Experiments demonstrate substantial improvements in trajectory prediction robustness and safety under complex traffic conditions.

Technology Category

Application Category

📝 Abstract
We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at https://github.com/valeoai/VideoActionModel
Problem

Research questions and friction points this paper is trying to address.

Autonomous driving through video models
Transferring video pre-training to driving
Evaluating safety in driving scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-regressive video model
Video-action imitation learning
Perception-to-action pipeline
🔎 Similar Papers
No similar papers found.
F
Florent Bartoccioni
valeo.ai, Paris, France
Elias Ramzi
Elias Ramzi
Reasearch scientist, valeo.ai
Deep learningComputer Vision
Victor Besnier
Victor Besnier
Valeo.ai
Deep learningComputer Vision
Shashanka Venkataramanan
Shashanka Venkataramanan
INRIA
Computer visionPattern recognitionMachine Learning
Tuan-Hung Vu
Tuan-Hung Vu
Senior Researcher, valeo.ai
Computer VisionDeep LearningDomain AdaptationZero-shot LearningOpen-Vocabulary
Yihong Xu
Yihong Xu
Research Scientist@Valeo.ai; PhD from RobotLearn, Inria Grenoble
Computer VisionDeep LearningMotion ForecastingMultiple-Object TrackingDomain Adaptation
Loïck Chambon
Loïck Chambon
PhD Student - Sorbonne University & Valeo.ai
Computer Vision
Spyros Gidaris
Spyros Gidaris
Senior research scientist at valeo.ai
Deep LearningComputer Vision
S
Serkan Odabas
valeo.ai, Paris, France
David Hurych
David Hurych
Research Scientist at Valeo.ai
computer visionmachine learningartifitial inteligence
Renaud Marlet
Renaud Marlet
Senior researcher at ENPC / Principal Scientist at Valeo.ai
Computer VisionScene Understanding3DGeometry Processing
Alexandre Boulch
Alexandre Boulch
Senior researcher at valeo.ai
Computer sciencecomputational geometrycomputer vision
Mickael Chen
Mickael Chen
H Company
Generative Models
'
'Eloi Zablocki
valeo.ai, Paris, France
Andrei Bursuc
Andrei Bursuc
valeo.ai
computer visionmachine learningself-supervised learninguncertainty estimationvisual search
Eduardo Valle
Eduardo Valle
valeo.ai
Machine LearningComputer VisionHealthEducation
Matthieu Cord
Matthieu Cord
Professor Sorbonne University / Scientific Director valeo.ai
Computer VisionImage ProcessingMachine LearningArtificial IntelligenceDeep Learning