Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

In unsupervised single-channel speech separation, unconditional diffusion models suffer from temporal speaker identity inconsistency due to the absence of speaker constraints. To address this, we propose a speaker-embedding-guided diffusion framework. Methodologically, we design a speech-specific solver based on Denoising Diffusion Probabilistic Models (DDPM) and incorporate speaker embeddings as conditional priors during the reverse sampling process, thereby ensuring both speaker identity consistency and separability of the estimated utterances. Our key contribution is the first explicit integration of speaker embeddings into the diffusion-based unsupervised speech separation pipeline—enabling dynamic speaker characterization without requiring labeled data. Experiments on standard benchmarks (e.g., WSJ0-2mix) demonstrate significant improvements in SI-SNRi (+1.8 dB) over prior diffusion-based methods. Moreover, the proposed approach exhibits strong robustness under realistic challenging conditions, including reverberation and additive noise.

Technology Category

Application Category

📝 Abstract

Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect real-world conditions. Instead, we revisit the source-model paradigm, training a diffusion generative model solely on anechoic speech and formulating separation as a diffusion inverse problem. However, unconditional diffusion models lack speaker-level conditioning, they can capture local acoustic structure but produce temporally inconsistent speaker identities in separated sources. To address this limitation, we propose Speaker-Embedding guidance that, during the reverse diffusion process, maintains speaker coherence within each separated track while driving embeddings of different speakers further apart. In addition, we propose a new separation-oriented solver tailored for speech separation, and both strategies effectively enhance performance on the challenging task of unsupervised source-model-based speech separation, as confirmed by extensive experimental results. Audio samples and code are available at https://runwushi.github.io/UnSepDiff_demo.

Problem

Research questions and friction points this paper is trying to address.

Developing unsupervised speech separation without paired training data

Maintaining speaker identity consistency in diffusion-based separation

Enhancing separation quality through speaker-embedding guidance strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion generative model for unsupervised speech separation

Speaker-embedding guidance maintains speaker coherence

Separation-oriented solver tailored for speech separation

🔎 Similar Papers

No similar papers found.