🤖 AI Summary
To address poor audio quality, weak semantic alignment, and audio-visual desynchronization in video-to-audio generation, this paper proposes MMAudio, a multimodal joint-training framework. MMAudio is the first to unify video-audio and text-audio dual-path generation within a single architecture. It introduces a frame-level conditional synchronization module to achieve fine-grained alignment between video features and the audio latent space, and employs flow matching as the end-to-end optimization objective. The method supports either video-only or video-plus-text conditional inputs. On public benchmarks, MMAudio achieves state-of-the-art performance: significantly improved audio fidelity, enhanced semantic alignment, and reduced audio-visual synchronization error. At inference, it generates 8-second audio clips in 1.23 seconds, with a compact model size of only 157 million parameters.
📝 Abstract
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio