Unifying Language-Action Understanding and Generation for Autonomous Driving

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key limitations of existing vision-language-action (VLA) models in autonomous driving: insufficient alignment between language instructions and actions, and inefficient autoregressive action generation. To overcome these challenges, the authors propose LinkVLA, a novel architecture that unifies language and action representations through a shared discrete codebook. The framework incorporates an auxiliary action-understanding task to establish bidirectional semantic mapping between modalities and employs a coarse-to-fine (C2F) two-stage generation strategy to enhance inference efficiency. This approach achieves, for the first time, simultaneous structural and semantic alignment between language and actions, integrating bidirectional modeling with an efficient generation paradigm. Evaluated on closed-loop driving benchmarks, LinkVLA significantly improves both instruction-following accuracy and overall driving performance while reducing inference time by 86%.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
action generation
language-action alignment
autonomous driving
instruction following
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
discrete codebook
bidirectional mapping
coarse-to-fine generation
autonomous driving
🔎 Similar Papers
2024-08-19IEEE Workshop/Winter Conference on Applications of Computer VisionCitations: 30
X
Xinyang Wang
State Key Lab of CAD&CG, Zhejiang University; Li Auto
Q
Qian Liu
Li Auto
W
Wenjie Ding
Li Auto
Z
Zhao Yang
Li Auto
W
Wei Li
Li Auto
C
Chang Liu
Li Auto
B
Bailin Li
Li Auto
Kun Zhan
Kun Zhan
AI Researcher, LiAuto
Autonomous DrivingComputer Vision3D Vision
X
Xianpeng Lang
Li Auto
Wei Chen
Wei Chen
State Key Lab of CAD&CG, Zhejiang university
visualizationvisual analytics