🤖 AI Summary
This work addresses the Sim2Real transfer and data augmentation challenges in physical AI and autonomous driving by proposing a world simulation generation method conditioned on multimodal spatial cues—semantic segmentation, depth, and edges. Methodologically, it introduces a novel spatially adaptive multimodal conditioning mechanism: a shared multimodal encoder jointly models heterogeneous signals, while a learnable spatial weight prediction module enables position-aware, fine-grained modality weighting—overcoming limitations of conventional single-modality or globally weighted approaches. The framework integrates a diffusion-based generative architecture with NVIDIA GB200 NVL72-optimized real-time inference strategies. Experiments demonstrate substantial improvements in cross-domain generalization; the system achieves real-time world synthesis at 1080p resolution and 30 fps on GB200 hardware. To foster reproducibility and community advancement, the model and source code are publicly released.
📝 Abstract
We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.