LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

📅 2024-10-21

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Discrete speech coding holds significant promise for language model–based speech generation, yet its adoption is hindered by high bitrates and strong entanglement between linguistic content and speaker identity. To address this, we propose LSCodec—a low-bitrate, speaker-decoupled discrete speech codec—introducing the first multi-stage unsupervised training framework. Our approach integrates a continuous information bottleneck, vector quantization, a discrete-token vocoder, and unsupervised speaker perturbation to achieve end-to-end mapping from continuous acoustic representations to a compact, speaker-disentangled discrete latent space. LSCodec employs only a single codebook with a reduced vocabulary size, yet surpasses baseline codecs in intelligibility and audio quality. Speaker conversion and speaker probing experiments demonstrate robust speaker disentanglement. Ablation studies validate the effectiveness of each component.

Technology Category

Application Category

📝 Abstract

Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a multi-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction evaluations, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. Voice conversion and speaker probing experiments prove the excellent speaker disentanglement of LSCodec, and ablation study verifies the effectiveness of the proposed training framework.

Problem

Research questions and friction points this paper is trying to address.

Reduces high bitrate in discrete speech tokens

Decouples redundant timbre information from speech

Improves intelligibility and audio quality efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-bitrate discrete speech codec

Speaker-decoupled space via quantization

Multi-stage unsupervised training framework

🔎 Similar Papers

No similar papers found.