Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

📅 2025-07-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the mismatch between continuous audio signals and conventional autoregressive discrete token modeling in audio generation. Method: We propose the first causal language model framework operating directly on continuous-value tokens—eliminating discrete tokenization—and represent audio segments as continuous vectors. A token-level diffusion mechanism is introduced to model continuous values, and, for the first time in a causal decoding architecture, we incorporate a masked next-token prediction objective to jointly preserve autoregressivity and enhance contextual modeling. Contribution/Results: Our approach unifies diffusion modeling with the causal language modeling paradigm while achieving significantly fewer parameters than state-of-the-art diffusion models. On AudioCaps, our method reduces FAD by 20–40% over AudioGen; the masked variant further lowers FAD by 41% and 33% relative to Base and Large baselines, respectively, while also improving KL divergence. Generated audio quality matches that of advanced diffusion-based methods.

Technology Category

Application Category

📝 Abstract
Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters -- 193M for our Base and 462M for our Large models.
Problem

Research questions and friction points this paper is trying to address.

Modeling continuous audio tokens without discretization
Improving audio generation over discrete token methods
Combining masked prediction with causal language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous-valued tokens replace discrete tokens
Token-wise diffusion models next-token distribution
Masked next-token prediction enhances causal LM
🔎 Similar Papers
No similar papers found.