LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

End-to-end speech large language models (Speech LLMs) demand ultra-low-bitrate, low-latency audio codecs for efficient real-time deployment. Method: We propose an industrial-grade neural audio tokenizer/detokenizer architecture that decouples semantic and acoustic feature modeling, employs multi-stage collaborative training, and integrates frame-wise compression, differentiable quantization, and streaming decoding—achieving ultra-low bitrates of 0.43–0.87 kbps at a frame rate of 16.67 Hz. Contribution/Results: Our method significantly reduces bandwidth requirements (<1 kbps) while preserving high speech intelligibility, naturalness, semantic expressiveness, and waveform reconstruction fidelity—outperforming existing approaches in the ultra-low-bitrate regime. The architecture is optimized for real-time, streaming interaction with Speech LLMs. Code and pre-trained models are publicly released to facilitate lightweight deployment and accelerate research in low-latency speech-language modeling.

Technology Category

Application Category

📝 Abstract

This paper presents LongCat-Audio-Codec, an audio tokenizer and detokenizer solution designed for industrial grade end-to-end speech large language models. By leveraging a decoupled model architecture and a multistage training strategy, LongCat-Audio-Codec exhibits robust semantic modeling capabilities, flexible acoustic feature extraction capabilities, and low-latency streaming synthesis capabilities. It encodes speech at an ultra-low frame rate of 16.67 Hz, with a minimum bitrate of 0.43 kbps and a maximum bitrate of 0.87 kbps. Evaluation results demonstrate that LongCat-Audio-Codec achieves strong speech intelligibility and is capable of synthesizing highquality speech at low bitrate, thus effectively balancing coding efficiency and decoding quality. The inference code and model checkpoints of LongCat-Audio-Codec are available at: https://github.com/meituan-longcat/LongCat-Audio-Codec.

Problem

Research questions and friction points this paper is trying to address.

Develops ultra-low frame rate audio tokenizer for speech LLMs

Enables low-bitrate speech synthesis with high intelligibility

Balances coding efficiency and decoding quality in streaming

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled model architecture for audio tokenization

Multistage training strategy for robust semantic modeling

Ultra-low frame rate encoding at 16.67 Hz

🔎 Similar Papers

No similar papers found.

Authors to Follow