๐ค AI Summary
This work addresses the limitation of traditional speech separation methods, which often optimize signal-level metrics at the expense of speech intelligibility, thereby degrading downstream task performance. For the first time, the authors integrate a speech language model into generative speech separation and propose a discrete multi-codebook sequence modeling framework. The approach encodes mixed speech into token sequences via vector quantization and employs an encoderโdecoder architecture to autoregressively generate target speech tokens. To enhance decoding efficiency without compromising linguistic consistency, a non-autoregressive residual token mechanism is introduced. Experimental results on LibriMix demonstrate that the proposed method significantly improves the intelligibility of separated speech and outperforms existing approaches across multiple downstream tasks.
๐ Abstract
Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.