🤖 AI Summary
To address the poor generalization and low speech fidelity of conventional target speaker extraction (TSE) methods, this paper proposes a two-stage decoder-only generative language model. In the first stage, semantic tokens are generated; in the second, acoustic tokens are produced—enabling explicit semantic-acoustic disentanglement. Key innovations include: (i) a semantic-acoustic staged generation paradigm; (ii) Frozen-LM Conditioning to mitigate exposure bias during autoregressive token prediction; and (iii) Direct Preference Optimization (DPO) tailored for speech-perceptual alignment. The model employs an autoregressive architecture built upon continuous self-supervised learning (SSL) or neural codec embeddings, supporting fine-grained token prediction. Evaluated on Libri2Mix, our approach outperforms all existing LM-based TSE methods, achieving significant gains in speech quality (PESQ), intelligibility (STOI), and speaker consistency (SI-SDR).
📝 Abstract
Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further employ DPO to better align outputs with human perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.