DriveMind: A Dual-VLM based Reinforcement Learning Framework for Autonomous Driving

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing end-to-end autonomous driving systems suffer from poor interpretability, lack of formal safety guarantees, and limited robustness of vision-language-guided reinforcement learning in dynamic environments. DriveMind addresses these challenges via a semantic reward framework that dynamically aligns perception with language-specified goals while embedding hierarchical kinematic constraints—namely, speed, lane adherence, and motion stability. Its key contributions include: (i) the first dual-vision-language-model (VLM) architecture—featuring a contrastive VLM for stepwise semantic anchoring and a novelty-triggered chain-of-thought distillation mechanism within the VLM encoder-decoder to enable adaptive prompt generation under semantic drift; and (ii) integration of a compact predictive world model with a hierarchical safety module. Evaluated on CARLA Town 2, DriveMind achieves an average speed of 19.4±2.3 km/h, a route completion rate of 0.98±0.03, and near-zero collisions—outperforming baselines by 4%. It further demonstrates zero-shot transfer capability to real-world driving footage.

Technology Category

Application Category

📝 Abstract

End-to-end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision-language-guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, a unified semantic reward framework that integrates: (i) a contrastive Vision-Language Model (VLM) encoder for stepwise semantic anchoring; (ii) a novelty-triggered VLM encoder-decoder, fine-tuned via chain-of-thought (CoT) distillation, for dynamic prompt generation upon semantic drift; (iii) a hierarchical safety module enforcing kinematic constraints (e.g., speed, lane centering, stability); and (iv) a compact predictive world model to reward alignment with anticipated ideal states. DriveMind achieves 19.4 +/- 2.3 km/h average speed, 0.98 +/- 0.03 route completion, and near-zero collisions in CARLA Town 2, outperforming baselines by over 4% in success rate. Its semantic reward generalizes zero-shot to real dash-cam data with minimal distributional shift, demonstrating robust cross-domain alignment and potential for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Enhancing interpretability and safety in end-to-end autonomous driving systems

Overcoming static prompts in vision-language-guided RL for dynamic driving scenes

Integrating semantic feedback with kinematic constraints for robust autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive VLM encoder for semantic anchoring

Novelty-triggered VLM for dynamic prompts

Hierarchical safety module enforcing constraints

🔎 Similar Papers

No similar papers found.