Motus: A Unified Latent Action World Model

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing approaches decouple perception, world modeling, and control into separate models, resulting in fragmented multimodal generative capabilities and hindering joint learning from large-scale heterogeneous data. To address this, we propose U-Action—a Unified Latent-space Action World Model—that achieves, for the first time, end-to-end joint modeling of visual understanding, video generation, and action control. Methodologically, U-Action employs a hybrid Transformer architecture coupled with a UniDiffuser-style scheduler, introduces optical-flow-driven pixel-level “delta-action” representations, adopts a three-stage progressive training strategy, and leverages a six-level data pyramid to enable scalable action pretraining. Evaluated on both simulation and real-world robotic tasks, U-Action outperforms X-VLA by 15% and Pi0.5 by 45% on benchmark tasks, and achieves 11–48% improvements on real-scene manipulation tasks—demonstrating substantial gains over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

Problem

Research questions and friction points this paper is trying to address.

Unifies multimodal generative capabilities in embodied agents

Integrates understanding, world modeling, and control into a single system

Enables learning from large-scale heterogeneous data for robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified latent action world model integrates multimodal experts

Mixture-of-Transformer architecture enables flexible mode switching

Optical flow extracts pixel-level delta action for pretraining

🔎 Similar Papers

Addressing and Visualizing Misalignments in Human Task-Solving Trajectories