SLM-SS: Speech Language Model for Generative Speech Separation

๐Ÿ“… 2026-01-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitation of traditional speech separation methods, which often optimize signal-level metrics at the expense of speech intelligibility, thereby degrading downstream task performance. For the first time, the authors integrate a speech language model into generative speech separation and propose a discrete multi-codebook sequence modeling framework. The approach encodes mixed speech into token sequences via vector quantization and employs an encoderโ€“decoder architecture to autoregressively generate target speech tokens. To enhance decoding efficiency without compromising linguistic consistency, a non-autoregressive residual token mechanism is introduced. Experimental results on LibriMix demonstrate that the proposed method significantly improves the intelligibility of separated speech and outperforms existing approaches across multiple downstream tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.
Problem

Research questions and friction points this paper is trying to address.

speech separation
speech intelligibility
downstream tasks
generative speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech Language Model
Generative Speech Separation
Discrete Multi-codebook Sequence Generation
Non-autoregressive Modeling
Speech Intelligibility
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tianhua Li
Auditory Cognition and Computational Acoustics Lab, MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Chenda Li
Chenda Li
Shanghai Jiao Tong University
Speech Separation
Wei Wang
Wei Wang
Shanghai Jiao Tong University
speech recognitionspeech enhancementtext-to-speech
X
Xin Zhou
Auditory Cognition and Computational Acoustics Lab, MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Xihui Chen
Xihui Chen
University of Luxembourg
PrivacyComputational social scienceDifferential privacyHeterogeneous graphGraph learning
J
Jianqing Gao
AI research Institute, iFLYTEK Company Limited, Hefei, Anhui, China
Yanmin Qian
Yanmin Qian
Professor, Shanghai Jiao Tong University
Speech and Language ProcessingSignal ProcessingMachine Learning