Motus: A Unified Latent Action World Model

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches decouple perception, world modeling, and control into separate models, resulting in fragmented multimodal generative capabilities and hindering joint learning from large-scale heterogeneous data. To address this, we propose U-Action—a Unified Latent-space Action World Model—that achieves, for the first time, end-to-end joint modeling of visual understanding, video generation, and action control. Methodologically, U-Action employs a hybrid Transformer architecture coupled with a UniDiffuser-style scheduler, introduces optical-flow-driven pixel-level “delta-action” representations, adopts a three-stage progressive training strategy, and leverages a six-level data pyramid to enable scalable action pretraining. Evaluated on both simulation and real-world robotic tasks, U-Action outperforms X-VLA by 15% and Pi0.5 by 45% on benchmark tasks, and achieves 11–48% improvements on real-scene manipulation tasks—demonstrating substantial gains over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
Problem

Research questions and friction points this paper is trying to address.

Unifies multimodal generative capabilities in embodied agents
Integrates understanding, world modeling, and control into a single system
Enables learning from large-scale heterogeneous data for robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified latent action world model integrates multimodal experts
Mixture-of-Transformer architecture enables flexible mode switching
Optical flow extracts pixel-level delta action for pretraining
🔎 Similar Papers
No similar papers found.
H
Hongzhe Bi
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Hengkai Tan
Hengkai Tan
Tsinghua University
Reinforcement LearningRobot LearningEmbodied AIDeep Generative Models
Shenghao Xie
Shenghao Xie
Ph.D. Student, AAIS, PKU
Computer VisionMachine Learning
Zeyuan Wang
Zeyuan Wang
PhD, The University of Sydney
NLPMedical Informatics
S
Shuhe Huang
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
H
Haitian Liu
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Ruowen Zhao
Ruowen Zhao
Tsinghua University
3D VisionGenerative Model
Y
Yao Feng
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Chendong Xiang
Chendong Xiang
First-year PHD student of computer science and technology, Tsinghua university
generate modelembodied AI
Y
Yinze Rong
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
H
Hongyan Zhao
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Hanyu Liu
Hanyu Liu
Key Laboratory of Material Simulation Methods and Software of MOE, Jilin University
Computational scienceHigh pressure
Zhizhong Su
Zhizhong Su
Horizon Robotics
Deep LearningComputer VisionAutonomous DrivingRobotics Learning
L
Lei Ma
Peking University
H
Hang Su
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
J
Jun Zhu
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University