🤖 AI Summary
This work addresses the problem of high-fidelity, video-driven audio generation, aiming for joint semantic and temporal alignment between audio and video. We propose a multimodal diffusion Transformer architecture that integrates visual semantic representations with an audio–video synchronization module to model cross-modal interactions among video, audio, and text at the frame level. To enable unified generation across sound effects, speech, singing, and music, we adopt a general-purpose latent audio codec, stereo rendering, and a flow-matching training objective. Furthermore, we release Kling-Audio-Eval—a production-grade benchmark for audio–video generation—and achieve state-of-the-art performance on four key metrics: distribution matching, semantic alignment, temporal synchronization, and audio fidelity—demonstrating substantial improvements in audio–video co-generation capability over prior publicly available methods.
📝 Abstract
We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.