U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Ultra-low frame-rate (5 Hz) speech coding suffers from degraded intelligibility and spectral distortion. Method: We propose U-Codec—the first high-fidelity neural speech codec supporting 5 Hz frame rate—integrating Transformers to model long-range inter-frame dependencies, a hierarchical discrete encoding architecture, and a novel global-local collaborative modeling mechanism. Crucially, we extend the autoregressive large language model–based text-to-speech (LLM-TTS) framework to jointly model 32-layer residual vector quantization (RVQ) and 5 Hz discrete tokens—the first such adaptation. Results: Experiments demonstrate that U-Codec preserves speech naturalness and speaker similarity while achieving ~3× faster inference than higher-frame-rate codecs. It is the first work to systematically validate the feasibility of high-quality speech synthesis driven solely by 5 Hz discrete tokens, establishing a new paradigm for ultra-low-bandwidth speech generation.

Technology Category

Application Category

📝 Abstract

We propose extbf{U-Codec}, an extbf{U}ltra low frame-rate neural speech extbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $ imes$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.

Problem

Research questions and friction points this paper is trying to address.

Achieving high-fidelity speech reconstruction at ultra low frame-rates

Addressing intelligibility loss from extreme compression at 5Hz

Enabling fast speech generation while maintaining quality in TTS

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ultra low frame-rate neural speech codec at 5Hz

Transformer-based inter-frame long-term dependency module

Residual vector quantization depth and codebook optimization

🔎 Similar Papers

No similar papers found.