MotuBrain: An Advanced World Action Model for Robot Control

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

240K/year
🤖 AI Summary
While existing vision-language-action models exhibit strong semantic generalization, they struggle to capture fine-grained world dynamics. This work proposes MotuBrain, a unified multimodal generative model built upon the UniDiffuser framework and a three-stream Mixture-of-Transformers architecture, which jointly models video and action sequences to support diverse inference paradigms—including policy learning, world modeling, and video generation. By integrating a unified multi-view representation, an explicit language-action coupling mechanism, and an efficient inference stack, MotuBrain seamlessly handles heterogeneous multimodal data while achieving over 50× faster inference, enabling real-time deployment. The model demonstrates exceptional generalization and controllability across a wide range of tasks.
📝 Abstract
Vision-Language-Action (VLA) models achieve strong semantic generalization but often lack fine-grained modeling of world dynamics. Recent work explores video generation models as a foundation for world modeling, leading to unified World Action Models (WAMs) that jointly model visual dynamics and actions. We present MotuBrain, a unified multimodal generative model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports multiple inference modes, including policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only and cross-embodiment robot data. To improve real-world applicability, MotuBrain introduces a unified multiview representation, explicit language-action coupling, and an efficient inference stack, achieving over 50x speedup for real-time deployment.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
world dynamics modeling
fine-grained modeling
robot control
multimodal generative model
Innovation

Methods, ideas, or system contributions that make the work stand out.

World Action Model
UniDiffuser
Mixture-of-Transformers
multimodal generative model
real-time robot control