DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models struggle to simultaneously preserve multi-subject identity consistency and enable precise control over motion at multiple granularities, often resulting in identity degradation or ambiguous control. To address this challenge, this work proposes DreamVideo-Omni, a unified framework employing a two-stage training paradigm. The first stage integrates multidimensional control signals—including appearance, global and local motion, and camera dynamics—while the second stage introduces a reinforcement learning–based latent identity reward mechanism aligned with human preferences to enhance identity fidelity. Key innovations include group-and-character embeddings to disentangle motion ambiguities among multiple subjects, conditional-aware 3D rotary positional encoding, and a hierarchical motion injection strategy. Evaluated on a newly curated large-scale dataset and the DreamOmni Bench benchmark, the proposed method significantly outperforms current approaches in video quality, identity preservation, and motion control accuracy.

Technology Category

Application Category

📝 Abstract
While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.
Problem

Research questions and friction points this paper is trying to address.

multi-subject identity
omni-motion control
video synthesis
identity preservation
motion granularity
Innovation

Methods, ideas, or system contributions that make the work stand out.

omni-motion control
multi-subject customization
latent identity reinforcement learning
hierarchical motion injection
condition-aware 3D rotary positional embedding
🔎 Similar Papers
No similar papers found.