EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation

πŸ“… 2026-03-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the perceptual degradation and cumulative instability in long-horizon planning that arise when visual encoders are unfrozen in vision-language-based autonomous driving models. To mitigate these issues, the authors propose a collaborative perception-planning distillation framework. The approach introduces a self-anchored visual distillation mechanism to enhance robustness in perceiving critical regions and designs a future-aware β€œoracle” teacher model that leverages trajectory-guided attention and a coarse-to-fine distillation strategy to refine predicted trajectories. Furthermore, Monte Carlo Dropout sampling is integrated to improve uncertainty modeling. Evaluated in open-loop settings, the method achieves state-of-the-art performance and significantly enhances closed-loop driving outcomes.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student's prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
autonomous driving
perception degradation
planning instability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
Collaborative Distillation
Self-Anchored Perception
Oracle-Guided Trajectory
Autonomous Driving
πŸ”Ž Similar Papers
No similar papers found.
Jiajun Cao
Jiajun Cao
Ph.D. Student, Peking University
MLLMComputer Vision
X
Xiaoan Zhang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; XPeng Motors
Xiaobao Wei
Xiaobao Wei
Institute of Software, Chinese Academy of Sciences
3D Vision
L
Liyuqiu Huang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; XPeng Motors
W
Wang Zijian
XPeng Motors
H
Hanzhen Zhang
XPeng Motors
Z
Zhengyu Jia
XPeng Motors
W
Wei Mao
XPeng Motors
Hao Wang
Hao Wang
Peking University
AI4ScienceEmbodied AImachine learning
X
Xianming Liu
XPeng Motors
Shuchang Zhou
Shuchang Zhou
Megvii Inc.
Artificial Intelligence
Y
Yang Wang
XPeng Motors
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models