World Simulation with Video Foundation Models for Physical AI

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the need for high-fidelity, long-horizon, instruction-controllable physical-world simulation in embodied intelligence, this paper introduces the Cosmos family of unified video foundation models. Methodologically: (1) we propose Cosmos-Predict2.5, a streaming generative architecture enabling joint modeling from text, image, or video inputs to virtual-world outputs; (2) we integrate Cosmos-Reason1—a vision-language understanding model—with reinforcement learning–based post-training to achieve fine-grained semantic alignment; and (3) we design Cosmos-Transfer2.5, a lightweight Sim2Real/Real2Real translation framework. Trained on 200 million videos, Cosmos supports dual-scale deployment (2B/14B parameters). Experiments demonstrate substantial improvements in spatiotemporal coherence, instruction following, and generation fidelity. The model enables synthetic data generation, policy evaluation, and closed-loop simulation. Code, pretrained models, and benchmark suites are publicly released to advance embodied AI research and deployment.

Technology Category

Application Category

📝 Abstract
We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$ imes$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.
Problem

Research questions and friction points this paper is trying to address.

Unifies text, image, and video generation for world simulation
Enables synthetic data generation and policy evaluation for robotics
Provides Sim2Real and Real2Real translation for embodied intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-based architecture unifies text, image, video generation
Reinforcement learning post-training enhances video quality alignment
Control-net framework enables robust Sim2Real video translation
🔎 Similar Papers
2024-09-10arXiv.orgCitations: 0