Mask World Model: Predicting What Matters for Robust Robot Policy Learning

πŸ“… 2026-04-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

243K/year
πŸ€– AI Summary
Existing RGB video–based world models are highly sensitive to irrelevant visual factors such as dynamic backgrounds and illumination changes, leading to poor policy generalization and fragile control. To address this, this work proposes the Mask World Model (MWM), which introduces semantic mask prediction into world modeling for the first time. Instead of predicting raw pixels, MWM leverages a video diffusion architecture to forecast semantic masks and employs a geometric information bottleneck to focus on essential physical dynamics and contact relationships. An integrated diffusion-based policy head enables end-to-end robust control. The method substantially outperforms current RGB-based world models on the LIBERO and RLBench simulation benchmarks and demonstrates strong generalization and robustness in real-world robotic experiments as well as under random token pruning stress tests.

Technology Category

Application Category

πŸ“ Abstract
World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.
Problem

Research questions and friction points this paper is trying to address.

world model
robot policy learning
generalization
visual distractions
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask World Model
semantic masks
video diffusion
information bottleneck
robust policy learning
πŸ”Ž Similar Papers
No similar papers found.
Y
Yunfan Lou
National University of Singapore, Singapore
Xiaowei Chi
Xiaowei Chi
The Hong Kong University of Science and Technology
Multimodal GenerationRoboticsComputer Vision
Xiaojie Zhang
Xiaojie Zhang
City University of New York, The Graduate Center
Edge computing
Zezhong Qian
Zezhong Qian
XianJiaotongUniversity
World ModelAutonomous DrivingVideo GenerationRobot Manipulation
C
Chengxuan Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
R
Rongyu Zhang
Nanjing University, Nanjing, China
Y
Yaoxu Lyu
Beijing Academy of Artificial Intelligence, Beijing, China
G
Guoyu Song
Peking University, Beijing, China
C
Chuyao Fu
Beijing Academy of Artificial Intelligence, Beijing, China
Haoxuan Xu
Haoxuan Xu
Beihang University
computer vision
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models