UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the limitations of existing audio language models, which lack a unified tokenizer capable of jointly supporting comprehension and generation and exhibit constrained generalization in few-shot and zero-shot settings. The authors propose ReasoningCodec, introducing a novel factorized audio tokenization scheme that decouples audio into reasoning tokens for semantic inference and reconstruction tokens for high-fidelity waveform synthesis. Built upon this mechanism, they present the first autoregressive audio language model framework that unifies understanding, generation, and text alignment. Through multi-stage training on large-scale multi-task data—comprising 100 billion text tokens and 60 billion audio tokens—the model achieves state-of-the-art performance across speech, sound, and music tasks, matching the comprehension capabilities of continuous representation methods while surpassing existing discrete tokenizers in generation quality and reconstruction fidelity, and substantially enhancing few-shot and zero-shot transfer abilities.

Technology Category

Application Category

📝 Abstract

We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.

Problem

Research questions and friction points this paper is trying to address.

audio language model

audio tokenizer

few-shot generalization

zero-shot generalization

foundation model

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReasoningCodec

factorized audio tokenization

unified audio language model

text-aligned audio representation

few-shot generalization

🔎 Similar Papers

dMel: Speech Tokenization made Simple

2024-07-22arXiv.orgCitations: 4