Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

πŸ“… 2025-11-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Open-source audio-visual generation models suffer from unstable cross-modal synchronization, primarily due to three deficiencies in diffusion-based generation: (1) cross-modal correspondence drift; (2) global attention’s inability to capture fine-grained temporal alignment; and (3) intra-modal bias in conventional Classifier-Free Guidance (CFG), which undermines cross-modal synergy. To address these issues, we propose a unified framework comprising: (i) a cross-task co-training paradigm that jointly optimizes bidirectional audio-to-video and video-to-audio generation; (ii) a global-local decoupled interaction module that separately models long-range dependencies and frame-level synchronization; and (iii) a synchronization-enhanced CFG incorporating cross-modal consistency constraints to mitigate modality-specific biases. Our method achieves significant improvements in fine-grained audio-visual synchronization accuracy and generation fidelity across multiple benchmarks, establishing new state-of-the-art performance.

Technology Category

Application Category

πŸ“ Abstract
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
Problem

Research questions and friction points this paper is trying to address.

Addressing audio-video alignment challenges in joint diffusion processes
Overcoming inefficient attention mechanisms for temporal synchronization
Solving intra-modal bias in classifier-free guidance for cross-modal sync
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Task Synergy training mitigates correspondence drift
Global-Local Decoupled Interaction enables precise temporal alignment
Synchronization-Enhanced CFG amplifies cross-modal alignment signals
πŸ”Ž Similar Papers
No similar papers found.
T
Teng Hu
Shanghai Jiao Tong University
Zhentao Yu
Zhentao Yu
Researcher, Tencent Hunyuan
Computer vision
Guozhen Zhang
Guozhen Zhang
Nanjing University
Video Frame Interpolation
Z
Zihan Su
Shanghai Jiao Tong University
Z
Zhengguang Zhou
Tencent Hunyuan
Y
Youliang Zhang
Tencent Hunyuan
Y
Yuan Zhou
Tencent Hunyuan
Q
Qinglin Lu
Tencent Hunyuan
Ran Yi
Ran Yi
Associate Professor, Shanghai Jiao Tong University
Computer VisionComputer Graphics