π€ AI Summary
This work addresses the perceptual degradation and cumulative instability in long-horizon planning that arise when visual encoders are unfrozen in vision-language-based autonomous driving models. To mitigate these issues, the authors propose a collaborative perception-planning distillation framework. The approach introduces a self-anchored visual distillation mechanism to enhance robustness in perceiving critical regions and designs a future-aware βoracleβ teacher model that leverages trajectory-guided attention and a coarse-to-fine distillation strategy to refine predicted trajectories. Furthermore, Monte Carlo Dropout sampling is integrated to improve uncertainty modeling. Evaluated in open-loop settings, the method achieves state-of-the-art performance and significantly enhances closed-loop driving outcomes.
π Abstract
Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student's prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.