LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of efficiently translating discrete semantic knowledge from multimodal large language models into continuous autonomous driving trajectories that satisfy kinematic constraints, ensure safety, and preserve high-level intent uncertainty. The authors propose LAD-Drive, a framework that decouples high-level intent from low-level planning by introducing a probabilistic meta-action distribution to explicitly model navigational uncertainty, thereby avoiding the information loss inherent in conventional one-hot encodings. Furthermore, they design an action-aware diffusion transformer that integrates truncated denoising with motion anchor optimization to achieve structured disentanglement between language instructions and trajectory generation. Evaluated on the LangAuto benchmark, LAD-Drive achieves state-of-the-art performance, improving driving scores by up to 59% while significantly reducing path deviation and collision rates.

Technology Category

Application Category

📝 Abstract

While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle's kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on https://github.com/iis-esslingen/lad-drive.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

trajectory generation

multimodal behavior

intent representation

navigational uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-Aware Diffusion

Probabilistic Meta-Action

Trajectory Generation