COVLM-RL: Critical Object-Oriented Reasoning for Autonomous Driving Using VLM-Guided Reinforcement Learning

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
End-to-end autonomous driving suffers from poor generalization, low training efficiency, and uninterpretable decision-making. To address these challenges, we propose a novel framework integrating critical-object–guided semantic reasoning with vision-language model (VLM)–guided reinforcement learning (RL). Our method introduces a pioneering critical-object–centric semantic reasoning paradigm; employs chain-of-thought prompting to generate semantic decision priors explicitly injected into RL training; and incorporates a semantic–action consistency loss to enhance policy interpretability and stability. By unifying multi-view perception fusion with VLM-based semantic modeling, our approach achieves a 30% improvement in task success rate and a 50% gain in zero-shot generalization to unseen scenarios within the CARLA simulator—demonstrating substantial gains in robustness and data efficiency.

Technology Category

Application Category

📝 Abstract
End-to-end autonomous driving frameworks face persistent challenges in generalization, training efficiency, and interpretability. While recent methods leverage Vision-Language Models (VLMs) through supervised learning on large-scale datasets to improve reasoning, they often lack robustness in novel scenarios. Conversely, reinforcement learning (RL)-based approaches enhance adaptability but remain data-inefficient and lack transparent decision-making. % contribution To address these limitations, we propose COVLM-RL, a novel end-to-end driving framework that integrates Critical Object-oriented (CO) reasoning with VLM-guided RL. Specifically, we design a Chain-of-Thought (CoT) prompting strategy that enables the VLM to reason over critical traffic elements and generate high-level semantic decisions, effectively transforming multi-view visual inputs into structured semantic decision priors. These priors reduce the input dimensionality and inject task-relevant knowledge into the RL loop, accelerating training and improving policy interpretability. However, bridging high-level semantic guidance with continuous low-level control remains non-trivial. To this end, we introduce a consistency loss that encourages alignment between the VLM's semantic plans and the RL agent's control outputs, enhancing interpretability and training stability. Experiments conducted in the CARLA simulator demonstrate that COVLM-RL significantly improves the success rate by 30% in trained driving environments and by 50% in previously unseen environments, highlighting its strong generalization capability.
Problem

Research questions and friction points this paper is trying to address.

Improves generalization and interpretability in autonomous driving systems.
Enhances training efficiency and robustness in novel driving scenarios.
Bridges high-level semantic reasoning with low-level control for stability.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Critical Object reasoning with VLM-guided reinforcement learning
Uses Chain-of-Thought prompting to generate semantic decision priors
Introduces consistency loss to align semantic plans with control outputs
🔎 Similar Papers
No similar papers found.
L
Lin Li
School of Mechanical and Aerospace Engineering, Nanyang Technological University, 639798, Singapore
Y
Yuxin Cai
School of Mechanical and Aerospace Engineering, Nanyang Technological University, 639798, Singapore
Jianwu Fang
Jianwu Fang
Xi'an Jiaotong University
Scene understandingSafe driving perception and planning
J
Jianru Xue
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, China
C
Chen Lv
School of Mechanical and Aerospace Engineering, Nanyang Technological University, 639798, Singapore