VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two critical bottlenecks in LLM-driven autonomous driving—weak visual representation and model redundancy—this paper proposes a vision-enhanced lightweight multimodal large language model (MLLM). Methodologically, we introduce three novel mechanisms: (1) cycle-consistent dynamic visual token pruning, (2) memory-augmented feature aggregation, and (3) distance-decoupled instruction attention, enabling efficient visual token compression and long-range vision–language joint modeling. Evaluated end-to-end in CARLA under closed-loop settings, our model reduces parameters from 7B to 1.3B (an 81% reduction) while improving driving scores by 15.4%, 16.8%, and 7.6% for short-, medium-, and long-range scenarios, respectively. These gains demonstrate substantial improvements in perceptual robustness and deployment feasibility for real-world autonomous driving systems.

Technology Category

Application Category

📝 Abstract
Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive`s effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.
Problem

Research questions and friction points this paper is trying to address.

Addresses collision risks from limited visual representations in autonomous driving
Reduces large parameter overhead of language models for deployment
Enhances joint visual-linguistic feature learning for long-range perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight MLLM architecture with enhanced vision components
Cycle-consistent dynamic visual pruning for compact tokens
Distance-decoupled instruction attention for joint feature learning
Ruifei Zhang
Ruifei Zhang
The Chinese University of Hong Kong, Shenzhen
Computer VisionMedical Image AnalysisVision and Language
W
Wei Zhang
Baidu Inc.
X
Xiao Tan
Baidu Inc.
Sibei Yang
Sibei Yang
Associate Professor, School of Computer Science and Engineering, Sun Yat-Sen University
Xiang Wan
Xiang Wan
Shenzhen Research Institute of Big Data
BioinformaticsData MiningBig Data Analysis
X
Xiaonan Luo
Guilin University of Electronic Technology
G
Guanbin Li
Sun Yat-sen University; Guangdong Key Laboratory of Big Data Analysis and Processing