CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing vision-language-action (VLA) models for autonomous driving lack explicit, planning-oriented intermediate representations, hindering effective integration of continuous spatiotemporal structure and world reasoning to produce reliable actions. This work proposes CoWorld-VLA, a multi-expert world reasoning framework that introduces four novel expert tokens—semantic interaction, geometric structure, dynamic evolution, and ego-trajectory—to extract complementary world knowledge through multi-source supervision and encode it into accessible planning conditions. By combining this multi-expert token mechanism with a diffusion-based hierarchical fusion planner, the model generates high-precision continuous trajectories through a joint denoising process. Evaluated on the NAVSIM v1 benchmark, CoWorld-VLA achieves state-of-the-art performance in scene generation, obstacle avoidance, and trajectory accuracy, with ablation studies confirming the complementarity and effectiveness of each expert token.

📝 Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/potatochip1211/CoWorld-VLA.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

world model

autonomous driving

action planning

intermediate representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-expert world model

expert tokens

diffusion-based planning