🤖 AI Summary
Current end-to-end autonomous driving models merely fit observed driving behaviors in data, lacking explicit modeling of underlying commonsense reasoning—leading to poor generalization in complex scenarios. This work proposes a vision-language model (VLM) distillation framework: during training, off-the-shelf VLMs (e.g., Qwen-VL, LLaVA) generate multimodal supervision signals—including action labels and natural-language reasoning annotations—to guide lightweight end-to-end driving models (e.g., TransFuser, UniAD) toward commonsense-aware decision-making; at inference, the VLM is entirely excluded, ensuring real-time performance. To our knowledge, this is the first approach to leverage VLMs solely as a supervision source—not as deployable components—in end-to-end driving training, thereby decoupling inference from computationally intensive reasoning models. On nuScenes, our method significantly improves planning accuracy and reduces collision rate by 18.7%, while demonstrating enhanced robustness under unseen weather conditions and interactive scenarios.
📝 Abstract
Human drivers rely on commonsense reasoning to navigate diverse and dynamic real-world scenarios. Existing end-to-end (E2E) autonomous driving (AD) models are typically optimized to mimic driving patterns observed in data, without capturing the underlying reasoning processes. This limitation constrains their ability to handle challenging driving scenarios. To close this gap, we propose VLM-AD, a method that leverages vision-language models (VLMs) as teachers to enhance training by providing additional supervision that incorporates unstructured reasoning information and structured action labels. Such supervision enhances the model's ability to learn richer feature representations that capture the rationale behind driving patterns. Importantly, our method does not require a VLM during inference, making it practical for real-time deployment. When integrated with state-of-the-art methods, VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.