VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 30

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Current end-to-end autonomous driving models merely fit observed driving behaviors in data, lacking explicit modeling of underlying commonsense reasoning—leading to poor generalization in complex scenarios. This work proposes a vision-language model (VLM) distillation framework: during training, off-the-shelf VLMs (e.g., Qwen-VL, LLaVA) generate multimodal supervision signals—including action labels and natural-language reasoning annotations—to guide lightweight end-to-end driving models (e.g., TransFuser, UniAD) toward commonsense-aware decision-making; at inference, the VLM is entirely excluded, ensuring real-time performance. To our knowledge, this is the first approach to leverage VLMs solely as a supervision source—not as deployable components—in end-to-end driving training, thereby decoupling inference from computationally intensive reasoning models. On nuScenes, our method significantly improves planning accuracy and reduces collision rate by 18.7%, while demonstrating enhanced robustness under unseen weather conditions and interactive scenarios.

Technology Category

Application Category

📝 Abstract

Human drivers rely on commonsense reasoning to navigate diverse and dynamic real-world scenarios. Existing end-to-end (E2E) autonomous driving (AD) models are typically optimized to mimic driving patterns observed in data, without capturing the underlying reasoning processes. This limitation constrains their ability to handle challenging driving scenarios. To close this gap, we propose VLM-AD, a method that leverages vision-language models (VLMs) as teachers to enhance training by providing additional supervision that incorporates unstructured reasoning information and structured action labels. Such supervision enhances the model's ability to learn richer feature representations that capture the rationale behind driving patterns. Importantly, our method does not require a VLM during inference, making it practical for real-time deployment. When integrated with state-of-the-art methods, VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.

Problem

Research questions and friction points this paper is trying to address.

Enhance autonomous driving with commonsense reasoning supervision

Improve handling of challenging real-world driving scenarios

Bridge gap between data patterns and underlying decision rationale

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VLM supervision for training

Enhances feature learning with reasoning

No VLM needed during inference

🔎 Similar Papers

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving