🤖 AI Summary
End-to-end speech large language models (Speech LLMs) demand ultra-low-bitrate, low-latency audio codecs for efficient real-time deployment.
Method: We propose an industrial-grade neural audio tokenizer/detokenizer architecture that decouples semantic and acoustic feature modeling, employs multi-stage collaborative training, and integrates frame-wise compression, differentiable quantization, and streaming decoding—achieving ultra-low bitrates of 0.43–0.87 kbps at a frame rate of 16.67 Hz.
Contribution/Results: Our method significantly reduces bandwidth requirements (<1 kbps) while preserving high speech intelligibility, naturalness, semantic expressiveness, and waveform reconstruction fidelity—outperforming existing approaches in the ultra-low-bitrate regime. The architecture is optimized for real-time, streaming interaction with Speech LLMs. Code and pre-trained models are publicly released to facilitate lightweight deployment and accelerate research in low-latency speech-language modeling.
📝 Abstract
This paper presents LongCat-Audio-Codec, an audio tokenizer and detokenizer solution designed for industrial grade end-to-end speech large language models. By leveraging a decoupled model architecture and a multistage training strategy, LongCat-Audio-Codec exhibits robust semantic modeling capabilities, flexible acoustic feature extraction capabilities, and low-latency streaming synthesis capabilities. It encodes speech at an ultra-low frame rate of 16.67 Hz, with a minimum bitrate of 0.43 kbps and a maximum bitrate of 0.87 kbps. Evaluation results demonstrate that LongCat-Audio-Codec achieves strong speech intelligibility and is capable of synthesizing highquality speech at low bitrate, thus effectively balancing coding efficiency and decoding quality. The inference code and model checkpoints of LongCat-Audio-Codec are available at: https://github.com/meituan-longcat/LongCat-Audio-Codec.