Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

📅 2024-06-01

📈 Citations: 5

✨ Influential: 1

career value

212K/year

🤖 AI Summary

This work addresses the task of generating high-fidelity audio waveforms from silent videos. We propose the first non-autoregressive video-to-audio (V2A) framework based on rectified flow matching. Methodologically, we design a channel-wise cross-modal feature fusion mechanism to enhance visual-auditory temporal alignment and introduce a guided vector field distillation strategy enabling single-step high-quality audio sampling. A feed-forward Transformer serves as the vector field estimator, significantly improving inference efficiency. On the VGGSound dataset, our method achieves 97.22% temporal alignment accuracy and an Inception Score 6.2% higher than state-of-the-art diffusion-based baselines—setting a new SOTA. To the best of our knowledge, this is the first work to jointly leverage rectified flow matching and cross-modal distillation for V2A, effectively balancing generation quality, temporal precision, and computational efficiency.

Technology Category

Application Category

📝 Abstract

Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io.

Problem

Research questions and friction points this paper is trying to address.

Audio Extraction

Video Sound Synchronization

High Quality Audio Conversion

Innovation

Methods, ideas, or system contributions that make the work stand out.

V2A technology

high-fidelity audio generation

synchronization accuracy

🔎 Similar Papers

No similar papers found.

Apple

San Diego, United States of America

AI Research Scientist, Video Generation and Post Training, FAIR