Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of jointly achieving high fidelity and cross-modal consistency in open-domain video-to-audio generation, this paper proposes a model-self-guided dual-role alignment mechanism: the video encoder simultaneously serves as both conditional controller and feature aligner, eliminating reliance on external classifiers and thereby enhancing generation coherence and generalization. Our method builds upon a flow-based Transformer architecture, integrating dual-role alignment, a model-self-guided objective function, and video-conditional training. On VGGSound, it achieves a state-of-the-art FAD of 0.40, along with superior performance in Fréchet Distance (FD), Inception Score (IS), and cross-modal alignment metrics. It further demonstrates strong generalization on UnAV-100. The core innovation lies in the first end-to-end alignment guidance framework that requires no external classifier—unifying optimization of audio realism and audio-visual consistency.

Technology Category

Application Category

📝 Abstract
We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio
Problem

Research questions and friction points this paper is trying to address.

Generating high-fidelity audio from open-domain videos
Improving cross-modal coherence and audio realism
Surpassing classifier-free guidance in video-to-audio generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-based Transformer model for scalable video-to-audio generation
Dual-role alignment mechanism improving audio-visual feature coherence
Model-guided training objective enhancing cross-modal realism
🔎 Similar Papers
No similar papers found.