Listen to Extract: Onset-Prompted Target Speaker Extraction

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of target speaker extraction in single-channel speech separation. We propose a waveform-level “listen-and-extract” paradigm: the enrollment utterance and the mixed speech are directly concatenated in the waveform domain, naturally generating a speaker onset prompt that end-to-end drives a deep neural network to learn time-frequency masks. Crucially, our approach eliminates the need for speaker embedding modules, explicit speech alignment, or auxiliary prompt encoders—achieving strong generalization with a minimalist architecture. Evaluated on standard benchmarks—including WSJ0-2mix, WHAM!, and WHAMR!—our method attains state-of-the-art (SOTA) or near-SOTA performance, significantly outperforming conventional speaker-embedding-guided approaches. These results empirically validate the effectiveness and robustness of waveform-level prompting for target speech separation.

Technology Category

Application Category

📝 Abstract
We propose $ extit{listen to extract}$ (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker's mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at the waveform level, and trains deep neural networks (DNN) to extract the target speech based on the concatenated mixture signal. The rationale is that, this way, an artificial speech onset is created for the target speaker and it could prompt the DNN (a) which speaker is the target to extract; and (b) spectral-temporal patterns of the target speaker that could help extraction. This simple approach produces strong TSE performance on multiple public TSE datasets including WSJ0-2mix, WHAM! and WHAMR!.
Problem

Research questions and friction points this paper is trying to address.

Extracting target speaker from mixed speech using enrollment utterance
Creating artificial speech onset to prompt DNN for target extraction
Improving target speaker extraction performance on public datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Waveform-level concatenation of enrollment and mixture
Deep neural networks for target speaker extraction
Artificial speech onset prompts DNN for extraction
🔎 Similar Papers
No similar papers found.
P
Pengjie Shen
Department of Computer Science, Inner Mongolia University, Hohhot 010021, China
K
Kangrui Chen
Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China
S
Shulin He
Department of Computer Science, Inner Mongolia University, Hohhot 010021, China
P
Pengru Chen
Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China
S
Shuqi Yuan
Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China
H
He Kong
School of Automation and Intelligent Manufacturing, Southern University of Science and Technology, Shenzhen 518055, China
Xueliang Zhang
Xueliang Zhang
Inner Mongolia University
Speech enhancementSpeech separationComputational Auditory Scene Analysis
Zhong-Qiu Wang
Zhong-Qiu Wang
Associate Professor, Southern University of Science and Technology
Computer AuditionSpeech SeparationMicrophone ArrayAudio Signal ProcessingDeep Learning