Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes SODA, a native audio foundation model that addresses the limitations of existing text-centric audio language models in effectively modeling general audio content. SODA jointly learns semantic, acoustic, and linguistic representations through autoregressive next-token prediction on a unified sequence of 500 billion discrete audio and text tokens. The study presents the first systematic investigation of scaling laws for discrete audio models, leveraging IsoFLOP analysis to identify optimal trade-offs between data volume and model size. A unified architecture is designed to support diverse multimodal audio tasks. Evaluated across a range of benchmarks—including speech-preserving speech-to-speech translation—the resulting models, spanning 135M to 4B parameters, demonstrate strong generalization capabilities.

Technology Category

Application Category

📝 Abstract
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
Problem

Research questions and friction points this paper is trying to address.

audio foundation models
discrete audio tokens
text-first models
general audio modeling
cross-modal capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete audio modeling
scaling laws
interleaved tokens
audio foundation models
next-token prediction
🔎 Similar Papers
No similar papers found.