ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing audio-language models typically employ frame-independent audio encoders, which struggle to capture cross-frame semantic dependencies, resulting in semantically impoverished discrete tokens and suboptimal compression efficiency. To address this, we propose a semantic-aware low-bitrate audio codec tokenizer: it introduces learnable query tokens and leverages query-based attention to enable global contextual modeling; and jointly optimizes masked autoencoding, semantic-prior-guided vector quantization, and autoregressive prediction losses to enhance both semantic representation fidelity and reconstruction quality. To the best of our knowledge, this is the first query-based global audio compression framework. It achieves state-of-the-art reconstruction quality at significantly lower bitrates and substantially improves both understanding and generation capabilities within unified audio-language models.

Technology Category

Application Category

📝 Abstract

Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Develops low-bitrate semantic-rich audio tokenizer for language models

Improves audio compression by capturing cross-frame context information

Enhances semantic encoding via MAE loss and vector quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-based compression captures holistic audio context

Masked autoencoder loss enhances semantic information

Vector quantization leverages semantic priors effectively

🔎 Similar Papers

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling