AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in integrating vision-language models (VLMs) into end-to-end autonomous driving systems, including misalignment between reasoning and action spaces, underutilization of general-purpose reasoning capabilities, and high inference latency. To overcome these issues, the authors propose a unified vision-language-action model featuring an asynchronous hybrid Transformer architecture. This design leverages a joint attention mechanism and an asynchronous execution strategy for fast and slow tasks, preserving the pretrained VLM’s general semantic capabilities while significantly reducing action generation latency. By combining semantic prompting with targeted fine-tuning, the model achieves state-of-the-art performance across multiple open- and closed-loop benchmarks. The results demonstrate that semantic prompting alone suffices for multi-task scene understanding, whereas effective action control still requires task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract
Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
end-to-end autonomous driving
distribution misalignment
inference latency
action policy generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Transformers
Vision-Language-Action
Asynchronous Inference
End-to-End Autonomous Driving
Pretrained VLMs
W
Wenhui Huang
Harvard University, US
Songyan Zhang
Songyan Zhang
Nanyang Technology University
Computer VisionAutonomous Driving
Q
Qihang Huang
Nanyang Technological University, Singapore
Z
Zhidong Wang
Nanyang Technological University, Singapore
Z
Zhiqi Mao
Nanyang Technological University, Singapore
C
Collister Chua
Nanyang Technological University, Singapore
Zhan Chen
Zhan Chen
Georgia Southern University
Mathematical modeling in biology and scientific computing
L
Long Chen
Xiaomi EV, China
C
Chen Lv
Nanyang Technological University, Singapore