Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the problem of high-fidelity, video-driven audio generation, aiming for joint semantic and temporal alignment between audio and video. We propose a multimodal diffusion Transformer architecture that integrates visual semantic representations with an audio–video synchronization module to model cross-modal interactions among video, audio, and text at the frame level. To enable unified generation across sound effects, speech, singing, and music, we adopt a general-purpose latent audio codec, stereo rendering, and a flow-matching training objective. Furthermore, we release Kling-Audio-Eval—a production-grade benchmark for audio–video generation—and achieve state-of-the-art performance on four key metrics: distribution matching, semantic alignment, temporal synchronization, and audio fidelity—demonstrating substantial improvements in audio–video co-generation capability over prior publicly available methods.

Technology Category

Application Category

📝 Abstract

We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.

Problem

Research questions and friction points this paper is trying to address.

Generates high-quality audio synchronized with video content

Enhances semantic and temporal alignment between video and audio

Models diverse audio scenarios including sound effects and music

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal diffusion transformers for video-audio-text interactions

Universal latent audio codec for diverse sound scenarios

Stereo rendering method for spatial audio presence

🔎 Similar Papers

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation