Latent Chain-of-Thought World Modeling for End-to-End Driving

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the low inference efficiency and insufficient safety of end-to-end autonomous driving in complex scenarios, this paper proposes Latent Chain-of-Thought (Latent-CoT): a paradigm that abandons natural language reasoning and instead jointly embeds action proposals and world-model predictions into a unified latent space, enabling coupled modeling of action-scene evolution. Methodologically, we introduce action-aligned latent representations, joint action-world tokenization, and a two-stage training framework comprising cold-start supervised pretraining—using rollouts from real-world trajectories—and closed-loop reinforcement learning. Evaluated on a large-scale end-to-end driving benchmark, our approach achieves significant improvements in both inference speed and trajectory quality. The reinforcement learning gains substantially surpass those of non-reasoning and text-based chain-of-thought baselines. To the best of our knowledge, this work is the first to empirically validate the effectiveness and superiority of latent-space chain-of-thought reasoning for autonomous driving.

Technology Category

Application Category

📝 Abstract

Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.

Problem

Research questions and friction points this paper is trying to address.

Improves driving performance with latent chain-of-thought reasoning

Unifies reasoning and decision-making in action-aligned latent space

Enhances safety and efficiency over text-based reasoning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent language replaces text for reasoning

Action-aligned latent space unifies reasoning and decisions

Supervised cold start then reinforcement learning post-training

🔎 Similar Papers

Enhancing End-to-End Autonomous Driving with Latent World Model