HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

📅 2025-08-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video generation methods lack synchronized high-fidelity audio, severely limiting immersion; meanwhile, video-to-audio generation faces core challenges including multimodal data scarcity, modality imbalance, and insufficient audio quality. This paper proposes the first end-to-end text-video joint-driven audio generation framework. We introduce a novel hundred-hour-scale automatically constructed multimodal dataset and design a representation alignment strategy alongside a multimodal diffusion Transformer architecture. Leveraging self-supervised audio feature guidance for latent diffusion training, we incorporate dual-stream audio-video joint attention and cross-attention-based textual semantic injection to effectively mitigate modality competition. Experiments demonstrate that our method consistently outperforms state-of-the-art approaches across key metrics—including audio fidelity, vision-language alignment, temporal synchronization, and distribution matching—significantly improving both quality and consistency of Foley sound synthesis.

Technology Category

Application Category

📝 Abstract
Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.
Problem

Research questions and friction points this paper is trying to address.

Generating synchronized audio for video content
Addressing multimodal data scarcity and imbalance
Improving audio fidelity and alignment quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated annotation for scalable multimodal datasets
Self-supervised audio features guiding latent diffusion
Multimodal diffusion transformer with dual-stream fusion
🔎 Similar Papers
No similar papers found.
S
Sizhe Shan
Tencent Hunyuan
Q
Qiulin Li
Tencent Hunyuan
Yutao Cui
Yutao Cui
Tencent Hunyuan
Generative ModelsMulti-ModalObject Tracking
M
Miles Yang
Tencent Hunyuan
Y
Yuehai Wang
Zhejiang University
Q
Qun Yang
Nanjing University of Aeronautics and Astronautics
J
Jin Zhou
Tencent Hunyuan
Z
Zhao Zhong
Tencent Hunyuan