🤖 AI Summary
Ultra-low frame-rate (5 Hz) speech coding suffers from degraded intelligibility and spectral distortion. Method: We propose U-Codec—the first high-fidelity neural speech codec supporting 5 Hz frame rate—integrating Transformers to model long-range inter-frame dependencies, a hierarchical discrete encoding architecture, and a novel global-local collaborative modeling mechanism. Crucially, we extend the autoregressive large language model–based text-to-speech (LLM-TTS) framework to jointly model 32-layer residual vector quantization (RVQ) and 5 Hz discrete tokens—the first such adaptation. Results: Experiments demonstrate that U-Codec preserves speech naturalness and speaker similarity while achieving ~3× faster inference than higher-frame-rate codecs. It is the first work to systematically validate the feasibility of high-quality speech synthesis driven solely by 5 Hz discrete tokens, establishing a new paradigm for ultra-low-bandwidth speech generation.
📝 Abstract
We propose extbf{U-Codec}, an extbf{U}ltra low frame-rate neural speech extbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $ imes$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.