Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In unsupervised single-channel speech separation, unconditional diffusion models suffer from temporal speaker identity inconsistency due to the absence of speaker constraints. To address this, we propose a speaker-embedding-guided diffusion framework. Methodologically, we design a speech-specific solver based on Denoising Diffusion Probabilistic Models (DDPM) and incorporate speaker embeddings as conditional priors during the reverse sampling process, thereby ensuring both speaker identity consistency and separability of the estimated utterances. Our key contribution is the first explicit integration of speaker embeddings into the diffusion-based unsupervised speech separation pipeline—enabling dynamic speaker characterization without requiring labeled data. Experiments on standard benchmarks (e.g., WSJ0-2mix) demonstrate significant improvements in SI-SNRi (+1.8 dB) over prior diffusion-based methods. Moreover, the proposed approach exhibits strong robustness under realistic challenging conditions, including reverberation and additive noise.

Technology Category

Application Category

📝 Abstract
Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect real-world conditions. Instead, we revisit the source-model paradigm, training a diffusion generative model solely on anechoic speech and formulating separation as a diffusion inverse problem. However, unconditional diffusion models lack speaker-level conditioning, they can capture local acoustic structure but produce temporally inconsistent speaker identities in separated sources. To address this limitation, we propose Speaker-Embedding guidance that, during the reverse diffusion process, maintains speaker coherence within each separated track while driving embeddings of different speakers further apart. In addition, we propose a new separation-oriented solver tailored for speech separation, and both strategies effectively enhance performance on the challenging task of unsupervised source-model-based speech separation, as confirmed by extensive experimental results. Audio samples and code are available at https://runwushi.github.io/UnSepDiff_demo.
Problem

Research questions and friction points this paper is trying to address.

Developing unsupervised speech separation without paired training data
Maintaining speaker identity consistency in diffusion-based separation
Enhancing separation quality through speaker-embedding guidance strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion generative model for unsupervised speech separation
Speaker-embedding guidance maintains speaker coherence
Separation-oriented solver tailored for speech separation
🔎 Similar Papers
No similar papers found.
Runwu Shi
Runwu Shi
Institute of Science Tokyo
Signal processingIntelligent vehicle
K
Kai Li
Tsinghua University
C
Chang Li
University of Science and Technology of China
J
Jiang Wang
Institute of Science Tokyo
S
Sihan Tan
Institute of Science Tokyo
Kazuhiro Nakadai
Kazuhiro Nakadai
Institute of Science Tokyo
Robot Audition and Scene AnalysisArtificial IntelligenceSignal and Speech ProcessingRobotics