GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor generalization and low speech fidelity of conventional target speaker extraction (TSE) methods, this paper proposes a two-stage decoder-only generative language model. In the first stage, semantic tokens are generated; in the second, acoustic tokens are produced—enabling explicit semantic-acoustic disentanglement. Key innovations include: (i) a semantic-acoustic staged generation paradigm; (ii) Frozen-LM Conditioning to mitigate exposure bias during autoregressive token prediction; and (iii) Direct Preference Optimization (DPO) tailored for speech-perceptual alignment. The model employs an autoregressive architecture built upon continuous self-supervised learning (SSL) or neural codec embeddings, supporting fine-grained token prediction. Evaluated on Libri2Mix, our approach outperforms all existing LM-based TSE methods, achieving significant gains in speech quality (PESQ), intelligibility (STOI), and speaker consistency (SI-SDR).

Technology Category

Application Category

📝 Abstract
Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further employ DPO to better align outputs with human perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.
Problem

Research questions and friction points this paper is trying to address.

Enhances target speaker extraction via a two-stage generative language model
Separates semantic and acoustic token generation for stable decoding
Improves speech quality, intelligibility, and speaker consistency in extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage decoder-only generative language model
Separates semantic and acoustic token generation
Uses continuous embeddings and DPO alignment
🔎 Similar Papers
No similar papers found.
H
Haoyang Li
Nanyang Technological University, Singapore
X
Xuyi Zhuang
Nanyang Technological University, Singapore
A
Azmat Adnan
Nanyang Technological University, Singapore
Y
Ye Ni
Southeast University, China
Wei Rao
Wei Rao
Tencent, China
Speaker ExtractionSpeaker RecognitionLanguage RecognitionSpeech Emotion RecognitionMachine Learning
S
Shreyas Gopal
Nanyang Technological University, Singapore
E
Eng Siong Chng
Nanyang Technological University, Singapore