🤖 AI Summary
Existing approaches decouple perception, world modeling, and control into separate models, resulting in fragmented multimodal generative capabilities and hindering joint learning from large-scale heterogeneous data. To address this, we propose U-Action—a Unified Latent-space Action World Model—that achieves, for the first time, end-to-end joint modeling of visual understanding, video generation, and action control. Methodologically, U-Action employs a hybrid Transformer architecture coupled with a UniDiffuser-style scheduler, introduces optical-flow-driven pixel-level “delta-action” representations, adopts a three-stage progressive training strategy, and leverages a six-level data pyramid to enable scalable action pretraining. Evaluated on both simulation and real-world robotic tasks, U-Action outperforms X-VLA by 15% and Pi0.5 by 45% on benchmark tasks, and achieves 11–48% improvements on real-scene manipulation tasks—demonstrating substantial gains over state-of-the-art methods.
📝 Abstract
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.