Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current end-to-end autonomous driving systems suffer from limited generalization under unseen scenarios and heterogeneous sensor configurations: VLM-based approaches introduce semantic-action inconsistency, while pure VLA methods incur prohibitive computational overhead. To address this, we propose Risk Semantic Distillation (RSD), the first framework to distill causal risk semantics—extracted from visual language models—into the bird’s-eye-view (BEV) feature space in an interpretable manner. RSD employs a lightweight RiskHead module to generate spatially fine-grained risk attention maps, enabling explicit modeling of risk associated with critical objects and scene boundaries. By synergistically integrating human driving priors with end-to-end learning, RSD requires neither additional annotations nor online VLM inference. Evaluated on the Bench2Drive benchmark, RSD achieves significant improvements in both perception and planning performance, demonstrating superior robustness in complex dynamic environments and enhanced cross-configuration generalization.

Technology Category

Application Category

📝 Abstract
The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Improves generalization in autonomous driving systems
Reduces computational demands of end-to-end vision-language-action frameworks
Enhances risk attention for spatial boundaries and objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills risk estimates from Vision-Language Models
Enhances Bird's-Eye-View features with risk attention
Uses plug-in RiskHead module for interpretable risk maps
🔎 Similar Papers
No similar papers found.