SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability and safety risks in existing token-based Mixture-of-Experts (MoE) mechanisms for autonomous driving vision-language-action (VLA) models, which stem from a disconnect between expert selection and scene-level decision-making. To resolve this, the authors propose a scene-adaptive MoE architecture that, for the first time, leverages bird’s-eye-view (BEV) features as structured routing signals instead of conventional token-level routing. Additionally, they introduce a conditional cross-modal causal attention mechanism to enable unified temporal reasoning across perception, language, action, and world knowledge. The proposed method achieves state-of-the-art performance on both the nuScenes open-loop planning benchmark and the LangAuto closed-loop simulation benchmark, outperforming current VLA and world models with fewer parameters.

Technology Category

Application Category

📝 Abstract
Recent advances in Vision-Language-Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token-level MoE mechanisms--which are inherited from LLM architectures--to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token-based expert specialization and scene-level decision-making.To address this, we propose SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's-eye-view (BEV) features that encapsulates traffic scene context, enabling scenario-dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world-knowledge, perception, language, and action, we introduce a Conditional Cross-Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed-loop benchmark demonstrate that SAMoE-VLA achieves state-of-the-art performance, outperforming prior VLA-based and world-model-based approaches with fewer parameters.Our code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
Mixture-of-Experts
Autonomous Driving
Scene-level Decision-making
Token-level MoE
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene-Adaptive MoE
Vision-Language-Action
Bird's-Eye-View Representation
Conditional Cross-Modal Causal Attention
Autonomous Driving
Z
Zihan You
Institute for AI Industry Research (AIR), Tsinghua University; School of Instrument Science and Engineering, Southeast University
H
Hongwei Liu
Institute for AI Industry Research (AIR), Tsinghua University; Zhili College, Tsinghua University
Chenxu Dang
Chenxu Dang
Huazhong University of Science and Technology
Computer VisionAutonomous Driving
Zhe Wang
Zhe Wang
Tsinghua University
Computer VisionAutonomous Driving
S
Sining Ang
Institute for AI Industry Research (AIR), Tsinghua University; Department of Automation, University of Science and Technology of China
A
Aoqi Wang
Institute for AI Industry Research (AIR), Tsinghua University; Department of Automation, University of Science and Technology Beijing
Yan Wang
Yan Wang
Tsinghua university; SenseTime
Neural CompressionComputer VisionMachine Learning