DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in autonomous driving where conventional reinforcement learning relies on handcrafted rewards or sparse collision signals, struggling to balance safe exploration with contextual awareness. Inspired by neuroscience, the authors propose an asynchronous dual-pathway architecture: during offline training, a vision-language model (VLM) integrates CLIP-based semantic goals, a lightweight detector, and an attention gating mechanism to generate hierarchical rewards—assessing spatial safety via a static pathway and reasoning multi-frame semantic risks through a dynamic pathway. At deployment, the VLM is removed, decoupling semantic understanding from real-time control. Evaluated in CARLA simulations, the method significantly improves collision avoidance, task success rates, and cross-scenario generalization, maintaining strong robustness even without explicit collision penalties.

Technology Category

Application Category

📝 Abstract
Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid advances in end-to-end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real-world settings. Recent vision-language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real-time vehicle control. To address these limitations, this paper proposes DriveVLM-RL, a neuroscience-inspired framework that integrates VLMs into RL through a dual-pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP-based contrasting language goals, and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real-time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM-RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real-time feasibility. Demo video and code are available at: https://zilin-huang.github.io/DriveVLM-RL-website/
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
reinforcement learning
vision-language models
safe decision-making
real-time control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Reinforcement Learning
Dual-Pathway Architecture
Asynchronous Training
Safe Autonomous Driving
🔎 Similar Papers
No similar papers found.