DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

๐Ÿ“… 2026-02-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing approaches struggle to achieve precise disentangled control of multiple identities and voice timbres within a unified framework. To address this challenge, this work proposes DreamID-Omni, a unified architecture that integrates multimodal conditions through a Symmetric Conditional Diffusion Transformer. The method employs a two-level disentanglement strategy: at the signal level, Synchronized Rotary Position Embedding (Synchronized RoPE) enforces attention-space binding across modalities, while at the semantic level, structured captions establish explicit attribute-to-subject mappings to mitigate identityโ€“timbre confusion. Coupled with a multi-task progressive training scheme, the proposed approach achieves state-of-the-art performance in video quality, audio fidelity, and audio-visual consistency, outperforming leading commercial models.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.
Problem

Research questions and friction points this paper is trying to address.

human-centric audio-video generation
identity-timbre disentanglement
multi-person scenarios
controllable generation
unified framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Symmetric Conditional Diffusion Transformer
Dual-Level Disentanglement
Synchronized RoPE
Structured Captions
Multi-Task Progressive Training
๐Ÿ”Ž Similar Papers
No similar papers found.