SLM-SS: Speech Language Model for Generative Speech Separation

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the limitation of traditional speech separation methods, which often optimize signal-level metrics at the expense of speech intelligibility, thereby degrading downstream task performance. For the first time, the authors integrate a speech language model into generative speech separation and propose a discrete multi-codebook sequence modeling framework. The approach encodes mixed speech into token sequences via vector quantization and employs an encoder–decoder architecture to autoregressively generate target speech tokens. To enhance decoding efficiency without compromising linguistic consistency, a non-autoregressive residual token mechanism is introduced. Experimental results on LibriMix demonstrate that the proposed method significantly improves the intelligibility of separated speech and outperforms existing approaches across multiple downstream tasks.

Technology Category

Application Category

📝 Abstract

Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.

Problem

Research questions and friction points this paper is trying to address.

speech separation

speech intelligibility

downstream tasks

generative speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech Language Model

Generative Speech Separation

Discrete Multi-codebook Sequence Generation

Non-autoregressive Modeling

Speech Intelligibility

🔎 Similar Papers

No similar papers found.