LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
End-to-end speech large language models (Speech LLMs) demand ultra-low-bitrate, low-latency audio codecs for efficient real-time deployment. Method: We propose an industrial-grade neural audio tokenizer/detokenizer architecture that decouples semantic and acoustic feature modeling, employs multi-stage collaborative training, and integrates frame-wise compression, differentiable quantization, and streaming decoding—achieving ultra-low bitrates of 0.43–0.87 kbps at a frame rate of 16.67 Hz. Contribution/Results: Our method significantly reduces bandwidth requirements (<1 kbps) while preserving high speech intelligibility, naturalness, semantic expressiveness, and waveform reconstruction fidelity—outperforming existing approaches in the ultra-low-bitrate regime. The architecture is optimized for real-time, streaming interaction with Speech LLMs. Code and pre-trained models are publicly released to facilitate lightweight deployment and accelerate research in low-latency speech-language modeling.

Technology Category

Application Category

📝 Abstract
This paper presents LongCat-Audio-Codec, an audio tokenizer and detokenizer solution designed for industrial grade end-to-end speech large language models. By leveraging a decoupled model architecture and a multistage training strategy, LongCat-Audio-Codec exhibits robust semantic modeling capabilities, flexible acoustic feature extraction capabilities, and low-latency streaming synthesis capabilities. It encodes speech at an ultra-low frame rate of 16.67 Hz, with a minimum bitrate of 0.43 kbps and a maximum bitrate of 0.87 kbps. Evaluation results demonstrate that LongCat-Audio-Codec achieves strong speech intelligibility and is capable of synthesizing highquality speech at low bitrate, thus effectively balancing coding efficiency and decoding quality. The inference code and model checkpoints of LongCat-Audio-Codec are available at: https://github.com/meituan-longcat/LongCat-Audio-Codec.
Problem

Research questions and friction points this paper is trying to address.

Develops ultra-low frame rate audio tokenizer for speech LLMs
Enables low-bitrate speech synthesis with high intelligibility
Balances coding efficiency and decoding quality in streaming
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled model architecture for audio tokenization
Multistage training strategy for robust semantic modeling
Ultra-low frame rate encoding at 16.67 Hz
🔎 Similar Papers
No similar papers found.
Xiaohan Zhao
Xiaohan Zhao
Mohamed bin Zayed University of Artificial Intelligence
efficient deep learningadversarial attack
H
Hongyu Xiang
LongCat Team, Meituan
S
Shengze Ye
LongCat Team, Meituan
S
Song Li
LongCat Team, Meituan
Zhengkun Tian
Zhengkun Tian
LongCat Team, Meituan
G
Guanyu Chen
LongCat Team, Meituan
K
Ke Ding
LongCat Team, Meituan
G
Guanglu Wan
LongCat Team, Meituan