Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenge of personalized target speech extraction in noisy, crowded environments, where obtaining clean enrollment utterances is often impractical. To overcome this limitation, we propose the first end-to-end enrollment-free target speech extraction method. Our approach leverages permutation-invariant teacher-supervised training to construct a structured and clusterable speaker embedding space. By integrating robust single-speaker embedding alignment with a multi-backend conditional speech extraction architecture, the model directly predicts speaker-identity-aligned embeddings from mixture signals to serve as extraction control cues. Evaluated on LibriMix, our method significantly outperforms baseline systems such as WavLM+K-means, and demonstrates consistent improvements in speech quality and intelligibility on real-world recordings from the DNS-Challenge dataset.

Technology Category

Application Category

📝 Abstract

Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision. On noisy LibriMix, the resulting embeddings form a structured and clusterable identity space, outperforming WavLM+K-means and separation-derived embeddings in standard clustering metrics. Conditioning these embeddings into multiple extraction back-ends consistently improves objective quality and intelligibility, and generalizes to real DNS-Challenge recordings.

Problem

Research questions and friction points this paper is trying to address.

target speech extraction

enrollment-free

speaker embeddings

mixture separation

crowded environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

enrollment-free

mixture-to-set

speaker embedding