DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization

๐Ÿ“… 2025-06-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses zero-shot language-guided audio source separation (LASS), requiring no task-specific training or fine-tuning. The method leverages the generative prior of a pre-trained audio diffusion model and introduces a diffusion-guided mask optimization framework operating at test time: it iteratively refines a time-frequency mask aligned to the input mixture by incorporating cross-modal languageโ€“audio alignment constraints, thereby mitigating modality mismatch between generative modeling and discriminative separation. To our knowledge, this is the first zero-shot adaptation of pre-trained audio diffusion models to LASS, establishing a novel paradigm wherein generative models are repurposed for discriminative separation tasks. Experiments demonstrate competitive performance against supervised methods across multiple LASS benchmarks, support open-vocabulary queries, and significantly reduce deployment overhead.

Technology Category

Application Category

๐Ÿ“ Abstract
Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing na""ive adaptations, we identify key limitations arising from modality-specific challenges.To address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the application of diffusion models beyond generation, establishing a new paradigm for zero-shot audio separation. The code is available at: https://wltschmrz.github.io/DGMO/
Problem

Research questions and friction points this paper is trying to address.

Exploring pretrained diffusion models for audio separation without training
Addressing modality challenges in zero-shot language-queried source separation
Proposing a training-free framework for precise spectrogram mask optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework using diffusion models
Diffusion-Guided Mask Optimization (DGMO) technique
Repurposes pretrained models for zero-shot separation
๐Ÿ”Ž Similar Papers
No similar papers found.