CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Uneven temporal information density in speech causes fixed-frame-rate (FFR) neural codecs to generate excessive redundant tokens during stationary segments (e.g., prolonged vowels, silence), limiting both compression efficiency and reconstruction fidelity. To address this, we propose CodecSlime—the first unsupervised, plug-and-play, architecture-agnostic dynamic frame-rate (DFR) mechanism for neural speech codecs. Its core innovations are: (1) ScheDFR, a content-aware frame-rate scheduler that adapts sampling density to phonetic dynamics; and (2) Melt-and-Cool, a co-optimized DFR training strategy ensuring stable convergence and high-fidelity token generation. CodecSlime integrates seamlessly into mainstream architectures (e.g., VQ-GAN) and enables single-model, multi-bitrate inference. Experiments demonstrate that at ~600 bps, it reduces word error rate (WER) by up to 46% over FFR baselines, while simultaneously improving reconstruction quality and computational efficiency—significantly advancing the rate-distortion frontier in neural speech compression.

Technology Category

Application Category

📝 Abstract
Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 46% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/.
Problem

Research questions and friction points this paper is trying to address.

Compress temporal redundancy in neural speech codecs
Enable dynamic frame rate for efficient speech encoding
Improve reconstruction quality while reducing bitrate
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic frame rate for neural speech codecs
Unsupervised, architecture-agnostic plugin-style method
ScheDFR and Melt-and-Cool for inference and training
🔎 Similar Papers
No similar papers found.
Hankun Wang
Hankun Wang
Shanghai Jiao Tong University
Speech Synthesis
Y
Yiwei Guo
X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Chongtian Shao
Chongtian Shao
Shanghai Jiao Tong University
natural language processingspeech processingcomputational linguistics
B
Bohan Li
X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
X
Xie Chen
X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
K
Kai Yu
X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China; MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing