🤖 AI Summary
This work addresses key challenges in integrating vision-language models (VLMs) into end-to-end autonomous driving systems, including misalignment between reasoning and action spaces, underutilization of general-purpose reasoning capabilities, and high inference latency. To overcome these issues, the authors propose a unified vision-language-action model featuring an asynchronous hybrid Transformer architecture. This design leverages a joint attention mechanism and an asynchronous execution strategy for fast and slow tasks, preserving the pretrained VLM’s general semantic capabilities while significantly reducing action generation latency. By combining semantic prompting with targeted fine-tuning, the model achieves state-of-the-art performance across multiple open- and closed-loop benchmarks. The results demonstrate that semantic prompting alone suffices for multi-task scene understanding, whereas effective action control still requires task-specific fine-tuning.
📝 Abstract
Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.