LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of end-to-end autonomous driving models in long-tail, rare scenarios, where existing vision-language-action (VLA) approaches suffer from low trajectory prediction accuracy, semantic bias introduced by language annotations, and insufficient inference efficiency. To overcome these limitations, the authors propose LatentVLA, a framework that leverages self-supervised latent action prediction to learn driving representations from unlabeled trajectory data, thereby eliminating reliance on language annotations and associated semantic biases. Furthermore, knowledge distillation is employed to transfer the generalization capability of the VLA model to a lightweight visual network, achieving a favorable trade-off between performance and real-time inference. The method sets a new state-of-the-art with a 92.4 PDMS score on NAVSIM and demonstrates exceptional zero-shot generalization on nuScenes.

Technology Category

Application Category

📝 Abstract
End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations due to limited scenario diversity. Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained visionlanguage models to address this limitation, yet face critical challenges: (1) numerical imprecision in trajectory prediction due to discrete tokenization, (2) heavy reliance on language annotations that introduce linguistic bias and annotation burden, and (3) computational inefficiency from multi-step chain-of-thought reasoning hinders real-time deployment. We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations, eliminating linguistic bias while learning rich driving representations from unlabeled trajectory data. Through knowledge distillation, LatentVLA transfers the generalization capabilities of VLA models to efficient vision-based networks, achieving both robust performance and real-time efficiency. LatentVLA establishes a new state-of-the-art on the NAVSIM benchmark with a PDMS score of 92.4 and demonstrates strong zeroshot generalization on the nuScenes benchmark.
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
vision-language models
long-tail scenarios
trajectory prediction
real-time deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Action Prediction
Vision-Language-Action (VLA)
Self-supervised Learning
Knowledge Distillation
Autonomous Driving
🔎 Similar Papers
No similar papers found.
C
Chengen Xie
1 Shanghai Innovation Institute, 2 OpenDriveLab at The University of Hong Kong
B
Bin Sun
3 Li Auto Inc.
Tianyu Li
Tianyu Li
Fudan University | OpenDriveLab
motion planningautonomous drivingcomputer vision
Junjie Wu
Junjie Wu
Center for High Pressure Science & Technology Advanced Research
Physics
Z
Zhihui Hao
3 Li Auto Inc.
X
Xianpeng Lang
3 Li Auto Inc.
Hongyang Li
Hongyang Li
Assistant Professor, University of Hong Kong
Computer VisionAutonomous DrivingRobotics