Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limited generalization of existing embodied intelligence models, which are often confined to single tasks. The authors propose the first unified vision-language-action foundation model that integrates manipulation, navigation, and trajectory prediction into a cohesive action-trajectory generation framework. This is achieved through an egocentric prompting mechanism and a DiT-based action decoder, trained jointly on large-scale, multi-source heterogeneous data—including robot trajectories, human demonstrations, simulation data, and vision-and-language navigation samples. The approach substantially enhances cross-task and cross-platform generalization, achieving state-of-the-art performance across multiple benchmarks: LIBERO (97.9%), Simpler-WidowX (73.7%), RoboTwin (86.1%/87.2%), R2R (69.0% OSR), RxR (59.6% SR), real-world ALOHA (76.9% OOD), and zero-shot DOMINO dynamic manipulation (26.6%).

📝 Abstract

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

Problem

Research questions and friction points this paper is trying to address.

embodied intelligence

task generalization

robot embodiment

vision-language-action

cross-environment generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language-action modeling

embodied foundation model

DiT-based action decoder