SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Existing autonomous driving systems struggle to simultaneously achieve high driving performance and robust language understanding, primarily due to misalignment among their respective representation spaces. To address this, we propose the first end-to-end framework unifying closed-loop driving control, vision-language understanding, and language-action alignment. Methodologically, our approach adopts a pure-vision vision-language model (VLM) architecture that processes only camera inputs; introduces a novel language-action alignment mechanism trained via self-supervision in simulated environments to ensure semantic consistency between linguistic instructions and driving behaviors; and performs joint optimization on the Bench2Drive benchmark. Experiments demonstrate state-of-the-art performance on CARLA Bench2Drive and victory in the CARLA Challenge 2024. The framework further exhibits strong cross-lingual generalization in language understanding tasks without compromising driving performance. Our core contribution is the first realization of three-way representational alignment—among driving policy, vision-language perception, and language-grounded action execution.

Technology Category

Application Category

📝 Abstract

Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model's answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.

Problem

Research questions and friction points this paper is trying to address.

Integrating LLMs for autonomous driving generalization and explainability.

Achieving high driving performance with extensive language understanding.

Aligning vision-language understanding with actionable driving decisions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs for autonomous driving generalization

Combines vision-language understanding with driving actions

Uses only camera, excludes LiDAR for cost efficiency

🔎 Similar Papers

No similar papers found.