LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the target speaker extraction (TSE) task by proposing LauraGPT—the first single-task speech separation method built exclusively upon an autoregressive, decoder-only language model. The approach jointly models continuous representations of both the mixed speech and the reference speech through a two-stage collaborative architecture: Stage I employs an autoregressive language model to predict discrete codec tokens of the target speech; Stage II leverages a lightweight encoder-only module to fuse multi-source conditions—namely, the mixture, reference speech, and initial tokens—to reconstruct high-fidelity waveforms. Key innovations include multi-condition cross-attention and stage-wise generation. Evaluated on mainstream TSE benchmarks, LauraGPT achieves state-of-the-art or competitive performance, significantly outperforming conventional discriminative models—particularly in SNR improvement and speech intelligibility metrics (STOI and ESTOI).

Technology Category

Application Category

📝 Abstract
We propose LauraTSE, an Auto-Regressive Decoder-Only Language Model for Target Speaker Extraction (TSE) based on the LauraGPT backbone. It employs a small-scale auto-regressive decoder-only language model which takes the continuous representations for both the mixture and the reference speeches and produces the first few layers of the target speech's discrete codec representations. In addition, a one-step encoder-only language model reconstructs the sum of the predicted codec embeddings using both the mixture and the reference information. Our approach achieves superior or comparable performance to existing generative and discriminative TSE models. To the best of our knowledge, LauraTSE is the first single-task TSE model to leverage an auto-regressive decoder-only language model as the backbone.
Problem

Research questions and friction points this paper is trying to address.

Extracts target speaker from mixed speech
Uses auto-regressive decoder-only model
Reconstructs speech with encoder-only model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-regressive decoder-only model for TSE
Generates target speech's discrete codec layers
Combines mixture and reference speech information
🔎 Similar Papers
No similar papers found.