U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Ultra-low frame-rate (5 Hz) speech coding suffers from degraded intelligibility and spectral distortion. Method: We propose U-Codec—the first high-fidelity neural speech codec supporting 5 Hz frame rate—integrating Transformers to model long-range inter-frame dependencies, a hierarchical discrete encoding architecture, and a novel global-local collaborative modeling mechanism. Crucially, we extend the autoregressive large language model–based text-to-speech (LLM-TTS) framework to jointly model 32-layer residual vector quantization (RVQ) and 5 Hz discrete tokens—the first such adaptation. Results: Experiments demonstrate that U-Codec preserves speech naturalness and speaker similarity while achieving ~3× faster inference than higher-frame-rate codecs. It is the first work to systematically validate the feasibility of high-quality speech synthesis driven solely by 5 Hz discrete tokens, establishing a new paradigm for ultra-low-bandwidth speech generation.

Technology Category

Application Category

📝 Abstract
We propose extbf{U-Codec}, an extbf{U}ltra low frame-rate neural speech extbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $ imes$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.
Problem

Research questions and friction points this paper is trying to address.

Achieving high-fidelity speech reconstruction at ultra low frame-rates
Addressing intelligibility loss from extreme compression at 5Hz
Enabling fast speech generation while maintaining quality in TTS
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ultra low frame-rate neural speech codec at 5Hz
Transformer-based inter-frame long-term dependency module
Residual vector quantization depth and codebook optimization
🔎 Similar Papers
No similar papers found.
X
Xusheng Yang
Peking University
Long Zhou
Long Zhou
Tencent Hunyuan
W
Wenfu Wang
Tencent AI Lab
K
Kai Hu
Tencent Hunyuan
S
Shulin Feng
Tencent Hunyuan
C
Chenxing Li
Tencent AI Lab
M
Meng Yu
Tencent AI Lab
D
Dong Yu
Tencent AI Lab
Yuexian Zou
Yuexian Zou
Peking University Shenzhen Graduate School
Machine LearningSpeech ProcessingImage Processing