Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 4

career value

239K/year

🤖 AI Summary

To address poor audio quality, weak semantic alignment, and audio-visual desynchronization in video-to-audio generation, this paper proposes MMAudio, a multimodal joint-training framework. MMAudio is the first to unify video-audio and text-audio dual-path generation within a single architecture. It introduces a frame-level conditional synchronization module to achieve fine-grained alignment between video features and the audio latent space, and employs flow matching as the end-to-end optimization objective. The method supports either video-only or video-plus-text conditional inputs. On public benchmarks, MMAudio achieves state-of-the-art performance: significantly improved audio fidelity, enhanced semantic alignment, and reduced audio-visual synchronization error. At inference, it generates 8-second audio clips in 1.23 seconds, with a compact model size of only 157 million parameters.

Technology Category

Application Category

📝 Abstract

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

Problem

Research questions and friction points this paper is trying to address.

Synthesize high-quality audio from video and text

Improve audio-visual synchronization via frame-level alignment

Achieve state-of-the-art video-to-audio generation efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal joint training with text-audio data

Conditional synchronization module for frame alignment

Flow matching objective for efficient synthesis

🔎 Similar Papers

Video-to-Audio Generation with Hidden Alignment