🤖 AI Summary
This work addresses the target speaker extraction (TSE) task by proposing LauraGPT—the first single-task speech separation method built exclusively upon an autoregressive, decoder-only language model. The approach jointly models continuous representations of both the mixed speech and the reference speech through a two-stage collaborative architecture: Stage I employs an autoregressive language model to predict discrete codec tokens of the target speech; Stage II leverages a lightweight encoder-only module to fuse multi-source conditions—namely, the mixture, reference speech, and initial tokens—to reconstruct high-fidelity waveforms. Key innovations include multi-condition cross-attention and stage-wise generation. Evaluated on mainstream TSE benchmarks, LauraGPT achieves state-of-the-art or competitive performance, significantly outperforming conventional discriminative models—particularly in SNR improvement and speech intelligibility metrics (STOI and ESTOI).
📝 Abstract
We propose LauraTSE, an Auto-Regressive Decoder-Only Language Model for Target Speaker Extraction (TSE) based on the LauraGPT backbone. It employs a small-scale auto-regressive decoder-only language model which takes the continuous representations for both the mixture and the reference speeches and produces the first few layers of the target speech's discrete codec representations. In addition, a one-step encoder-only language model reconstructs the sum of the predicted codec embeddings using both the mixture and the reference information. Our approach achieves superior or comparable performance to existing generative and discriminative TSE models. To the best of our knowledge, LauraTSE is the first single-task TSE model to leverage an auto-regressive decoder-only language model as the backbone.