🤖 AI Summary
Existing end-to-end autonomous driving systems suffer from poor interpretability, lack of formal safety guarantees, and limited robustness of vision-language-guided reinforcement learning in dynamic environments. DriveMind addresses these challenges via a semantic reward framework that dynamically aligns perception with language-specified goals while embedding hierarchical kinematic constraints—namely, speed, lane adherence, and motion stability. Its key contributions include: (i) the first dual-vision-language-model (VLM) architecture—featuring a contrastive VLM for stepwise semantic anchoring and a novelty-triggered chain-of-thought distillation mechanism within the VLM encoder-decoder to enable adaptive prompt generation under semantic drift; and (ii) integration of a compact predictive world model with a hierarchical safety module. Evaluated on CARLA Town 2, DriveMind achieves an average speed of 19.4±2.3 km/h, a route completion rate of 0.98±0.03, and near-zero collisions—outperforming baselines by 4%. It further demonstrates zero-shot transfer capability to real-world driving footage.
📝 Abstract
End-to-end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision-language-guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, a unified semantic reward framework that integrates: (i) a contrastive Vision-Language Model (VLM) encoder for stepwise semantic anchoring; (ii) a novelty-triggered VLM encoder-decoder, fine-tuned via chain-of-thought (CoT) distillation, for dynamic prompt generation upon semantic drift; (iii) a hierarchical safety module enforcing kinematic constraints (e.g., speed, lane centering, stability); and (iv) a compact predictive world model to reward alignment with anticipated ideal states. DriveMind achieves 19.4 +/- 2.3 km/h average speed, 0.98 +/- 0.03 route completion, and near-zero collisions in CARLA Town 2, outperforming baselines by over 4% in success rate. Its semantic reward generalizes zero-shot to real dash-cam data with minimal distributional shift, demonstrating robust cross-domain alignment and potential for real-world deployment.