UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-source audio-visual generation methods exhibit weak cross-modal modeling capabilities, resulting in lip-audio asynchrony and semantic inconsistency. To address this, we propose UniAVGen—the first unified audio-visual generation framework based on a dual-branch diffusion Transformer. Its core contributions are: (1) an asymmetric cross-modal interaction mechanism enabling spatiotemporal audio-video alignment; (2) a face-aware modulation module that enhances dynamic lip motion modeling; and (3) a modality-aware classifier-free guidance strategy improving multi-task generalization. The framework integrates bidirectional temporal alignment cross-attention with a dynamic region-weighting mechanism. Extensive experiments demonstrate that, trained on only 1.3M samples, UniAVGen surpasses large-scale baseline models in lip-audio synchronization accuracy, timbre fidelity, and emotional coherence—achieving state-of-the-art performance with significantly reduced data requirements.

Technology Category

Application Category

📝 Abstract
Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.
Problem

Research questions and friction points this paper is trying to address.

Improving lip synchronization in audio-video generation
Enhancing semantic consistency across audio and video modalities
Unifying multiple audio-video tasks within a single model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch joint synthesis with Diffusion Transformers
Asymmetric Cross-Modal Interaction for spatiotemporal synchronization
Modality-Aware Classifier-Free Guidance for cross-modal correlation
🔎 Similar Papers
No similar papers found.
Guozhen Zhang
Guozhen Zhang
Nanjing University
Video Frame Interpolation
Z
Zixiang Zhou
Tencent Hunyuan
T
Teng Hu
Shanghai Jiao Tong University
Ziqiao Peng
Ziqiao Peng
Renmin University of China
3D Face AnimationTalking Head Generation
Y
Youliang Zhang
Tsinghua University
Y
Yi Chen
Tencent Hunyuan
Y
Yuan Zhou
Tencent Hunyuan
Q
Qinglin Lu
Tencent Hunyuan
L
Limin Wang
State Key Laboratory for Novel Software Technology, Nanjing University; Shanghai AI Lab