CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
Existing vision-language-action (VLA) models for autonomous driving lack explicit, planning-oriented intermediate representations, hindering effective integration of continuous spatiotemporal structure and world reasoning to produce reliable actions. This work proposes CoWorld-VLA, a multi-expert world reasoning framework that introduces four novel expert tokens—semantic interaction, geometric structure, dynamic evolution, and ego-trajectory—to extract complementary world knowledge through multi-source supervision and encode it into accessible planning conditions. By combining this multi-expert token mechanism with a diffusion-based hierarchical fusion planner, the model generates high-precision continuous trajectories through a joint denoising process. Evaluated on the NAVSIM v1 benchmark, CoWorld-VLA achieves state-of-the-art performance in scene generation, obstacle avoidance, and trajectory accuracy, with ablation studies confirming the complementarity and effectiveness of each expert token.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/potatochip1211/CoWorld-VLA.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
world model
autonomous driving
action planning
intermediate representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-expert world model
expert tokens
diffusion-based planning
vision-language-action (VLA)
structured world representation
M
Minqing Huang
Afari Intelligent Drive
Y
Yujiao Xiang
Afari Intelligent Drive; University of Electronic Science and Technology of China
Z
Zihan Liang
Afari Intelligent Drive; Shanghai Jiao Tong University
J
Jiajie Huang
Afari Intelligent Drive; Beijing University Of Posts and Telecommunications
J
Jingqi Wang
Afari Intelligent Drive
Z
Zhi Xu
Afari Intelligent Drive
F
Feiyang Tan
Afari Intelligent Drive
H
Hangning Zhou
Afari Intelligent Drive
M
Mu Yang
Afari Intelligent Drive
Gong Chen
Gong Chen
Nanjing University
Magnetic imaging