Training-Free Multimodal Guidance for Video to Audio Generation

๐Ÿ“… 2025-09-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video-to-audio (V2A) generation methods either rely on large-scale paired datasets for joint training or model only pairwise cross-modal similarity, failing to ensure global multimodal consistency. To address this, we propose a training-free Multimodal Diffusion Guidance (MDG) mechanism that, for the first time, leverages the geometric volume relationship formed by video, audio, and text embeddings in a shared high-dimensional space to achieve unified cross-modal alignment. MDG is plug-and-playโ€”compatible with any pre-trained audio diffusion model without fine-tuning. Extensive experiments on VGGSound and AudioCaps demonstrate that our method significantly improves the realism and semantic fidelity of generated audio. Quantitatively and perceptually, it outperforms both existing training-free and supervised approaches in audio quality and multimodal consistency, establishing a new state-of-the-art for zero-shot V2A generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.
Problem

Research questions and friction points this paper is trying to address.

Training-free multimodal guidance for video-to-audio generation
Enforcing unified alignment across video, audio, and text modalities
Improving perceptual quality without retraining pretrained audio models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free multimodal guidance for video-to-audio generation
Uses modality embedding volume for unified alignment
Plug-and-play control for pretrained audio diffusion models
๐Ÿ”Ž Similar Papers
No similar papers found.