Do World Action Models Generalize Better than VLAs? A Robustness Study

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work presents the first systematic comparison of World Action Models (WAMs) and Vision–Language–Action models (VLAs) in terms of robustness and generalization across unseen environments, evaluating them on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks. Using large-scale video-pretrained WAMs—such as Cosmos-Policy and LingBot-VA—and state-of-the-art VLA policies like π₀.₅, the study conducts end-to-end assessments under visual–language perturbations. Results demonstrate that WAMs substantially outperform most data-hungry VLAs, with LingBot-VA achieving a 74.2% success rate on RoboTwin 2.0-Plus and Cosmos-Policy reaching 82.2% on LIBERO-Plus. These findings underscore the critical role of video-pretrained dynamic priors in enhancing generalization for real-world robotic manipulation tasks.

Technology Category

Application Category

📝 Abstract
Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision-language-action (VLA), which repurpose large-scale vision-language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks. Nevertheless, their performance remains constrained by the scope of their training data, exhibiting limited generalization to unseen scenarios and vulnerability to diverse contextual perturbations. More recently, world models have been revisited as an alternative to VLAs. These models, referred to as world action models (WAMs), are built upon world models that are trained on large corpora of video data to predict future states. With minor adaptations, their latent representation can be decoded into robot actions. It has been suggested that their explicit dynamic prediction capacity, combined with spatiotemporal priors acquired from web-scale video pretraining, enables WAMs to generalize more effectively than VLAs. In this paper, we conduct a comparative study of prominent state-of-the-art VLA policies and recently released WAMs. We evaluate their performance on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks under various visual and language perturbations. Our results show that WAMs achieve strong robustness, with LingBot-VA reaching 74.2% success rate on RoboTwin 2.0-Plus and Cosmos-Policy achieving 82.2% on LIBERO-Plus. While VLAs such as $π_{0.5}$ can achieve comparable robustness on certain tasks, they typically require extensive training with diverse robotic datasets and varied learning objectives. Hybrid approaches that partially incorporate video-based dynamic learning exhibit intermediate robustness, highlighting the importance of how video priors are integrated.
Problem

Research questions and friction points this paper is trying to address.

robot action planning
generalization
robustness
vision-language-action models
world action models
Innovation

Methods, ideas, or system contributions that make the work stand out.

World Action Models
Vision-Language-Action Models
Robustness Evaluation
Video Pretraining
Dynamic Prediction
🔎 Similar Papers
No similar papers found.
Z
Zhanguang Zhang
Huawei Technologies
Z
Zhiyuan Li
Huawei Technologies, University of Toronto
B
Behnam Rahmati
Huawei Technologies
Rui Heng Yang
Rui Heng Yang
University of Toronto
Computer VisionRoboticsNeural Network AccelerationModel Compression
Y
Yintao Ma
Huawei Technologies
Amir Rasouli
Amir Rasouli
Noah's Ark Laboratory
RoboticsComputer VisionAutonomous DrivingVisual Attention
S
Sajjad Pakdamansavoji
Huawei Technologies
Y
Yangzheng Wu
Huawei Technologies
Lingfeng Zhang
Lingfeng Zhang
PhD student at Tsinghua University
embodied ai
Tongtong Cao
Tongtong Cao
Researcher, Huawei Noah's Ark Lab
RoboticsEmbodied AIAutonomous driving
F
Feng Wen
Huawei Technologies
X
Xingyue Quan
Huawei Technologies
Yingxue Zhang
Yingxue Zhang
Huawei
Graph representation learningGraph ReasoningLLMs ReasoningKnowledge GraphsRecommender Systems