🤖 AI Summary
This work addresses the challenge of efficiently leveraging spatiotemporal priors from pretrained video models for robotic visuomotor control without resorting to complex multi-stage fine-tuning or architectural modifications. The authors propose a single-stage fine-tuning approach that directly repurposes the pretrained video diffusion model Cosmos-Predict2 into an end-to-end policy, preserving its original architecture. By harnessing its latent diffusion process, the model simultaneously generates actions, future state images, and value predictions, enabling trajectory planning at test time. Notably, this is the first method to encode latent video frames as both actions and value estimates without altering the model structure, thereby integrating planning and continual world model refinement within the same framework. The approach achieves state-of-the-art results with 98.5% and 67.1% average success rates on the LIBERO and RoboCasa simulation benchmarks, respectively, and outperforms both from-scratch diffusion policies and leading vision-language-action models in real-world dual-arm tasks.
📝 Abstract
Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/