🤖 AI Summary
Discrete speech coding holds significant promise for language model–based speech generation, yet its adoption is hindered by high bitrates and strong entanglement between linguistic content and speaker identity. To address this, we propose LSCodec—a low-bitrate, speaker-decoupled discrete speech codec—introducing the first multi-stage unsupervised training framework. Our approach integrates a continuous information bottleneck, vector quantization, a discrete-token vocoder, and unsupervised speaker perturbation to achieve end-to-end mapping from continuous acoustic representations to a compact, speaker-disentangled discrete latent space. LSCodec employs only a single codebook with a reduced vocabulary size, yet surpasses baseline codecs in intelligibility and audio quality. Speaker conversion and speaker probing experiments demonstrate robust speaker disentanglement. Ablation studies validate the effectiveness of each component.
📝 Abstract
Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a multi-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction evaluations, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. Voice conversion and speaker probing experiments prove the excellent speaker disentanglement of LSCodec, and ablation study verifies the effectiveness of the proposed training framework.