NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing audio codecs operate at excessively high frame rates (>50 FPS), severely hindering the training and inference efficiency of large speech models. To address this, we propose an ultra-low-frame-rate (12.5 FPS) efficient audio codec. Our method integrates non-autoregressive sequence modeling, high-fidelity acoustic feature compression, low-frame-rate temporal encoding, and a jointly optimized quantization–reconstruction network. Through systematic ablation studies, we co-optimize frame rate, bit rate, and causal design—achieving high-quality speech reconstruction for the first time at such extreme temporal sparsity. Evaluated across multiple bit-rate settings, our codec significantly outperforms state-of-the-art baselines in objective and perceptual metrics. It preserves speech naturalness and intelligibility while reducing inference latency by 3–5×. This work establishes a new paradigm for deploying real-time large speech models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.

Problem

Research questions and friction points this paper is trying to address.

High frame-rate audio codecs slow down LLM training and inference

Need for low frame-rate codecs to reduce autoregressive steps

Achieving high-quality audio compression at ultra-low frame rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ultra low frame-rate audio codec

High-quality compression at 12.5 FPS

Optimized bitrate and causality settings

🔎 Similar Papers

No similar papers found.