MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

๐Ÿ“… 2026-02-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing discrete audio tokenizers rely on pretrained encoders, semantic distillation, or heterogeneous CNN architectures, introducing fixed inductive biases that constrain reconstruction fidelity and scalability. This work proposes CATโ€”a fully end-to-end, homogeneous, and scalable audio tokenization methodโ€”that, for the first time, employs a purely causal Transformer to jointly optimize the encoder, quantizer, and decoder without auxiliary modules. Leveraging a 1.6-billion-parameter model pretrained on 3 million hours of diverse audio, CAT consistently outperforms existing codecs across speech, sound, and music at various bitrates. The resulting discrete tokens enable the first purely autoregressive text-to-speech system that surpasses non-autoregressive counterparts and demonstrate competitive performance in automatic speech recognition tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.
Problem

Research questions and friction points this paper is trying to address.

audio tokenizer
discrete representation
scalability
inductive bias
audio foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end audio tokenization
homogeneous Transformer architecture
scalable audio foundation model
causal Transformer
discrete audio representation
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yitian Gong
K
Kuangwei Chen
Zhaoye Fei
Zhaoye Fei
Fudan University
Natural Language Processing
X
Xiaogui Yang
K
Ke Chen
Y
Yang Wang
K
Kexin Huang
M
Mingshu Chen
R
Ruixiao Li
Q
Qingyuan Cheng
Shimin Li
Shimin Li
Fudan University
Large Language ModelSpeech Language Model
X
Xipeng Qiu