Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the Sim2Real transfer and data augmentation challenges in physical AI and autonomous driving by proposing a world simulation generation method conditioned on multimodal spatial cues—semantic segmentation, depth, and edges. Methodologically, it introduces a novel spatially adaptive multimodal conditioning mechanism: a shared multimodal encoder jointly models heterogeneous signals, while a learnable spatial weight prediction module enables position-aware, fine-grained modality weighting—overcoming limitations of conventional single-modality or globally weighted approaches. The framework integrates a diffusion-based generative architecture with NVIDIA GB200 NVL72-optimized real-time inference strategies. Experiments demonstrate substantial improvements in cross-domain generalization; the system achieves real-time world synthesis at 1080p resolution and 30 fps on GB200 hardware. To foster reproducibility and community advancement, the model and source code are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.
Problem

Research questions and friction points this paper is trying to address.

Generates world simulations using multimodal spatial inputs.
Enables customizable and adaptive conditional world generation.
Supports real-time applications like robotics and autonomous vehicles.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive multimodal control for world generation
Real-time generation using NVIDIA GB200 NVL72
Open-source models for research acceleration
🔎 Similar Papers
No similar papers found.
N
Nvidia Hassan Abu Alhaija
NVIDIA
J
Jose Alvarez
NVIDIA
M
Maciej Bala
NVIDIA
T
Tiffany Cai
NVIDIA
T
Tianshi Cao
NVIDIA
L
Liz Cha
NVIDIA
J
Joshua Chen
NVIDIA
M
Mike Chen
NVIDIA
Francesco Ferroni
Francesco Ferroni
NVIDIA
machine learningdeep learningroboticscomputer visionphysics
Sanja Fidler
Sanja Fidler
University of Toronto, NVIDIA
Computer Vision
Dieter Fox
Dieter Fox
University of Washington and AI2
RoboticsArtificial IntelligenceComputer Vision
Yunhao Ge
Yunhao Ge
Research Scientist, NVIDIA
Deep LearningComputer VisionGenerative AIRobotics
J
Jinwei Gu
NVIDIA
Ali Hassani
Ali Hassani
NVIDIA
High-Performance AIComputer Vision
M
Michael Isaev
NVIDIA
Pooya Jannaty
Pooya Jannaty
Brown University
Shiyi Lan
Shiyi Lan
NVIDIA
VisionLLM AgentVisual Gen
Tobias Lasser
Tobias Lasser
NVIDIA, Technische Universität München
Computational ImagingInverse Problems in TomographyMedical Image Processing
Huan Ling
Huan Ling
University of Toronto
computer vision
M
Ming-Yu Liu
NVIDIA
X
Xian Liu
NVIDIA
Y
Yifan Lu
NVIDIA
A
Alice Luo
NVIDIA
Q
Qianli Ma
NVIDIA
Hanzi Mao
Hanzi Mao
Research Scientist, Nvidia
Deep LearningComputer Vision
Fabio Ramos
Fabio Ramos
University of Sydney and NVIDIA
roboticsmachine learning
Xuanchi Ren
Xuanchi Ren
University of Toronto, NVIDIA
Computer VisionMachine LearningGenerative Model
Tianchang Shen
Tianchang Shen
University of Toronto, NVIDIA
3D VisionDeep Learning
S
Shitao Tang
NVIDIA
Ting-Chun Wang
Ting-Chun Wang
NVIDIA Research
Computer visionComputer graphics
J
Jay Wu
NVIDIA
J
Jiashu Xu
NVIDIA
S
Stella Xu
NVIDIA
Kevin Xie
Kevin Xie
University of Toronto
Y
Yuchong Ye
NVIDIA
X
Xiaodong Yang
NVIDIA
X
Xiaohui Zeng
NVIDIA
Y
Yu Zeng
NVIDIA