GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address the poor generalization and low speech fidelity of conventional target speaker extraction (TSE) methods, this paper proposes a two-stage decoder-only generative language model. In the first stage, semantic tokens are generated; in the second, acoustic tokens are produced—enabling explicit semantic-acoustic disentanglement. Key innovations include: (i) a semantic-acoustic staged generation paradigm; (ii) Frozen-LM Conditioning to mitigate exposure bias during autoregressive token prediction; and (iii) Direct Preference Optimization (DPO) tailored for speech-perceptual alignment. The model employs an autoregressive architecture built upon continuous self-supervised learning (SSL) or neural codec embeddings, supporting fine-grained token prediction. Evaluated on Libri2Mix, our approach outperforms all existing LM-based TSE methods, achieving significant gains in speech quality (PESQ), intelligibility (STOI), and speaker consistency (SI-SDR).

Technology Category

Application Category

📝 Abstract

Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further employ DPO to better align outputs with human perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.

Problem

Research questions and friction points this paper is trying to address.

Enhances target speaker extraction via a two-stage generative language model

Separates semantic and acoustic token generation for stable decoding

Improves speech quality, intelligibility, and speaker consistency in extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage decoder-only generative language model

Separates semantic and acoustic token generation

Uses continuous embeddings and DPO alignment

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

Research Engineer/Research Scientist, Audio

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs