🤖 AI Summary
Residual vector quantization (RVQ) in neural speech codecs suffers from training instability and inefficient residual decomposition, limiting reconstruction quality and robustness. To address this, we propose Progressive Residual Entropy Quantization (PREQ), a novel framework that leverages a pre-trained speech enhancement model as a guidance signal to orchestrate multi-stage residual quantization: it first quantizes low-entropy components to ensure foundational fidelity, then progressively models high-entropy details. This enables entropy-aware hierarchical quantization learning, substantially improving training stability and rate-distortion performance. Experiments demonstrate that PREQ consistently outperforms standard RVQ under both clean and noisy conditions, achieving significant gains in objective speech reconstruction metrics (e.g., MCD, PESQ) and downstream speech synthesis tasks—particularly enhancing noise-robustness.
📝 Abstract
Neural speech codecs have achieved strong performance in low-bitrate compression, but residual vector quantization (RVQ) often suffers from unstable training and ineffective decomposition, limiting reconstruction quality and efficiency. We propose PURE Codec (Progressive Unfolding of Residual Entropy), a novel framework that guides multi-stage quantization using a pre-trained speech enhancement model. The first quantization stage reconstructs low-entropy, denoised speech embeddings, while subsequent stages encode residual high-entropy components. This design improves training stability significantly. Experiments demonstrate that PURE consistently outperforms conventional RVQ-based codecs in reconstruction and downstream speech language model-based text-to-speech, particularly under noisy training conditions.