MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses query-driven audio source separation by reusing pretrained video-to-audio generative models. We propose MMAudioSep, a plug-and-play generative audio-visual separation method that requires no from-scratch training. Instead, it leverages lightweight fine-tuning—conditioned on either video or text—combined with cross-modal semantic alignment and generative modeling to achieve fine-grained separation while preserving the original video-to-audio generation capability. Its core innovation lies in a knowledge-transfer-based architectural design, enabling a single model to jointly support both upstream generative modeling and downstream separation tasks. Extensive experiments demonstrate that MMAudioSep significantly outperforms existing deterministic and generative separation methods across multiple benchmarks. Crucially, fine-tuned models retain full native generation performance, validating seamless dual functionality. This work establishes a novel paradigm for efficient reuse of multimodal foundation models in auditory perception tasks.

Technology Category

Application Category

📝 Abstract

We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.

Problem

Research questions and friction points this paper is trying to address.

Separating sounds using video or text queries

Leveraging pretrained video-to-audio generative models

Retaining original generation ability after fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pretrained video-to-audio generative model

Fine-tunes model for video/text-queried sound separation

Retains original video-to-audio generation capability

🔎 Similar Papers

Video-to-Audio Generation with Hidden Alignment