🤖 AI Summary
This work proposes an action-centric world-action model that addresses the high inference cost and reliance on future video generation quality inherent in existing approaches, where visual and action representations are tightly coupled. By introducing a causal architecture that decouples action prediction from video generation, the model enables policy decisions without depending on rendered video outputs. Furthermore, it incorporates visual dynamics constraints and dual-task supervision signals to enhance action plausibility while allowing video generation to be skipped during inference for accelerated decision-making. Leveraging a pretrained video backbone and large-scale robotic datasets, the method achieves a 9× faster inference speed than Motus and a 7% higher task success rate on real robotic platforms, while also outperforming pi-0.5 by 95% on RoboTwin 2.0.
📝 Abstract
World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.