Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of high-fidelity, video-driven audio generation, aiming for joint semantic and temporal alignment between audio and video. We propose a multimodal diffusion Transformer architecture that integrates visual semantic representations with an audio–video synchronization module to model cross-modal interactions among video, audio, and text at the frame level. To enable unified generation across sound effects, speech, singing, and music, we adopt a general-purpose latent audio codec, stereo rendering, and a flow-matching training objective. Furthermore, we release Kling-Audio-Eval—a production-grade benchmark for audio–video generation—and achieve state-of-the-art performance on four key metrics: distribution matching, semantic alignment, temporal synchronization, and audio fidelity—demonstrating substantial improvements in audio–video co-generation capability over prior publicly available methods.

Technology Category

Application Category

📝 Abstract
We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.
Problem

Research questions and friction points this paper is trying to address.

Generates high-quality audio synchronized with video content
Enhances semantic and temporal alignment between video and audio
Models diverse audio scenarios including sound effects and music
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal diffusion transformers for video-audio-text interactions
Universal latent audio codec for diverse sound scenarios
Stereo rendering method for spatial audio presence
🔎 Similar Papers
No similar papers found.
J
Jun Wang
Kuaishou Technology, Beijing, China
X
Xijuan Zeng
Kuaishou Technology, Beijing, China
Chunyu Qiang
Chunyu Qiang
Kuaishou Technology; TJU; CASIA
Speech Synthesis
Ruilong Chen
Ruilong Chen
Kuaishou Technology; NUDT; BUAA
Speech ProcessingComputer Vision
S
Shiyao Wang
Kuaishou Technology, Beijing, China
L
Le Wang
Kuaishou Technology, Beijing, China
W
Wangjing Zhou
Kuaishou Technology, Beijing, China
P
Pengfei Cai
Kuaishou Technology, Beijing, China
J
Jiahui Zhao
Kuaishou Technology, Beijing, China
N
Nan Li
Kuaishou Technology, Beijing, China
Zihan Li
Zihan Li
University of Washington
Foundation ModelAI for HealthcareMultimodal Learning
Yuzhe Liang
Yuzhe Liang
Shanghai Jiao Tong University
Deep learningMultimodal Learning
Xiaopeng Wang
Xiaopeng Wang
Institute of Automation, Chinese Academy of Sciences
Fake Audio DetectionText To SpeechSpeech Large Model
H
Haorui Zheng
Kuaishou Technology, Beijing, China
M
Ming Wen
Kuaishou Technology, Beijing, China
K
Kang Yin
Kuaishou Technology, Beijing, China
Y
Yiran Wang
Kuaishou Technology, Beijing, China
N
Nan Li
Kuaishou Technology, Beijing, China
F
Feng Deng
Kuaishou Technology, Beijing, China
L
Liang Dong
Kuaishou Technology, Beijing, China
C
Chen Zhang
Kuaishou Technology, Beijing, China
D
Di Zhang
Kuaishou Technology, Beijing, China
Kun Gai
Kun Gai
Senior Director & Researcher, Alibaba Group
Machine LearningComputational Advertising