LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

📅 2024-10-21
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Discrete speech coding holds significant promise for language model–based speech generation, yet its adoption is hindered by high bitrates and strong entanglement between linguistic content and speaker identity. To address this, we propose LSCodec—a low-bitrate, speaker-decoupled discrete speech codec—introducing the first multi-stage unsupervised training framework. Our approach integrates a continuous information bottleneck, vector quantization, a discrete-token vocoder, and unsupervised speaker perturbation to achieve end-to-end mapping from continuous acoustic representations to a compact, speaker-disentangled discrete latent space. LSCodec employs only a single codebook with a reduced vocabulary size, yet surpasses baseline codecs in intelligibility and audio quality. Speaker conversion and speaker probing experiments demonstrate robust speaker disentanglement. Ablation studies validate the effectiveness of each component.

Technology Category

Application Category

📝 Abstract
Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a multi-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction evaluations, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. Voice conversion and speaker probing experiments prove the excellent speaker disentanglement of LSCodec, and ablation study verifies the effectiveness of the proposed training framework.
Problem

Research questions and friction points this paper is trying to address.

Reduces high bitrate in discrete speech tokens
Decouples redundant timbre information from speech
Improves intelligibility and audio quality efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-bitrate discrete speech codec
Speaker-decoupled space via quantization
Multi-stage unsupervised training framework
🔎 Similar Papers
No similar papers found.
Y
Yiwei Guo
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China
Zhihan Li
Zhihan Li
Kuaishou Technology, Tsinghua University
Anomaly DetectionAIOps
Chenpeng Du
Chenpeng Du
ByteDance
Speech Interaction
Hankun Wang
Hankun Wang
Shanghai Jiao Tong University
Speech Synthesis
X
Xie Chen
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China
K
Kai Yu
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China