🤖 AI Summary
Current video generation models are constrained by unimodal conditional inputs, resulting in weak cross-modal interaction, insufficient modality diversity, and consequently limited world modeling capability and poor physical consistency. To address these limitations, we propose a multimodal world-aware video generation framework. Our method introduces a dynamic noise-scheduling mechanism and a context-aware modality switcher to unify training across heterogeneous inputs—including segmentation masks, skeletal poses, DensePose, optical flow, and depth maps. We curate a large-scale multimodal video dataset comprising 1.3 million samples. Leveraging diffusion models, our architecture integrates modular parameterization, multi-task loss optimization, and context learning. Experiments demonstrate substantial improvements over state-of-the-art methods across multiple benchmarks in terms of generation quality, spatiotemporal coherence, and zero-shot transferability, with outputs exhibiting stronger adherence to real-world physical principles.
📝 Abstract
Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo