🤖 AI Summary
In unsupervised single-channel speech separation, unconditional diffusion models suffer from temporal speaker identity inconsistency due to the absence of speaker constraints. To address this, we propose a speaker-embedding-guided diffusion framework. Methodologically, we design a speech-specific solver based on Denoising Diffusion Probabilistic Models (DDPM) and incorporate speaker embeddings as conditional priors during the reverse sampling process, thereby ensuring both speaker identity consistency and separability of the estimated utterances. Our key contribution is the first explicit integration of speaker embeddings into the diffusion-based unsupervised speech separation pipeline—enabling dynamic speaker characterization without requiring labeled data. Experiments on standard benchmarks (e.g., WSJ0-2mix) demonstrate significant improvements in SI-SNRi (+1.8 dB) over prior diffusion-based methods. Moreover, the proposed approach exhibits strong robustness under realistic challenging conditions, including reverberation and additive noise.
📝 Abstract
Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect real-world conditions. Instead, we revisit the source-model paradigm, training a diffusion generative model solely on anechoic speech and formulating separation as a diffusion inverse problem. However, unconditional diffusion models lack speaker-level conditioning, they can capture local acoustic structure but produce temporally inconsistent speaker identities in separated sources. To address this limitation, we propose Speaker-Embedding guidance that, during the reverse diffusion process, maintains speaker coherence within each separated track while driving embeddings of different speakers further apart. In addition, we propose a new separation-oriented solver tailored for speech separation, and both strategies effectively enhance performance on the challenging task of unsupervised source-model-based speech separation, as confirmed by extensive experimental results. Audio samples and code are available at https://runwushi.github.io/UnSepDiff_demo.