UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing audio-language modeling (ALM) research treats audio understanding and text-to-audio generation as disjoint tasks, lacking a unified framework for multimodal reasoning. Method: We propose the first end-to-end unified audio-language model, jointly modeling audio understanding, text-to-audio generation, and cross-modal generative reasoning. Our core innovation is a cross-modal generative reasoning mechanism that dynamically fuses audio and text representations within an implicit chain-of-thought. We further introduce audio token prediction, multi-source data mixing, optimized training recipes, and efficient inference techniques. Contribution/Results: The model achieves state-of-the-art performance across diverse benchmarks—including ASR, audio question answering, TTS, speech synthesis, and multi-step audio reasoning. Human evaluation confirms significant improvements in cross-modal logical inference capability, establishing a new paradigm for general-purpose audio-language intelligence.

Technology Category

Application Category

📝 Abstract

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

Problem

Research questions and friction points this paper is trying to address.

Unifying audio understanding and text generation tasks

Developing single model for multimodal audio-text reasoning

Enabling cross-modal reasoning for complex audio generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model for audio understanding and generation

Direct audio token prediction for text-to-audio generation

Cross-modal reasoning using both text and audio

🔎 Similar Papers

No similar papers found.