HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

📅 2025-08-23

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Current video generation methods lack synchronized high-fidelity audio, severely limiting immersion; meanwhile, video-to-audio generation faces core challenges including multimodal data scarcity, modality imbalance, and insufficient audio quality. This paper proposes the first end-to-end text-video joint-driven audio generation framework. We introduce a novel hundred-hour-scale automatically constructed multimodal dataset and design a representation alignment strategy alongside a multimodal diffusion Transformer architecture. Leveraging self-supervised audio feature guidance for latent diffusion training, we incorporate dual-stream audio-video joint attention and cross-attention-based textual semantic injection to effectively mitigate modality competition. Experiments demonstrate that our method consistently outperforms state-of-the-art approaches across key metrics—including audio fidelity, vision-language alignment, temporal synchronization, and distribution matching—significantly improving both quality and consistency of Foley sound synthesis.

Technology Category

Application Category

📝 Abstract

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

Problem

Research questions and friction points this paper is trying to address.

Generating synchronized audio for video content

Addressing multimodal data scarcity and imbalance

Improving audio fidelity and alignment quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated annotation for scalable multimodal datasets

Self-supervised audio features guiding latent diffusion

Multimodal diffusion transformer with dual-stream fusion

🔎 Similar Papers

MambaFoley: Foley Sound Generation using Selective State-Space Models