DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work proposes a diffusion-based large audio language model to address the high data and computational costs as well as low inference efficiency associated with scaling autoregressive architectures. For the first time, it demonstrates that diffusion architectures can serve as effective backbone networks at practical scales. The model is trained entirely on open-source audio corpora through a four-stage curriculum incorporating semantic–acoustic dual adapters, large-scale supervised fine-tuning, and variance-reduced preference optimization. Evaluated on the MMSU, MMAU, and MMAR benchmarks, the proposed model significantly outperforms its predecessor DIFFA and matches the performance of leading autoregressive models, thereby establishing the feasibility and competitiveness of the diffusion paradigm for general-purpose audio understanding.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding. Our code is available at https://github.com/NKU-HLT/DIFFA.git.

Problem

Research questions and friction points this paper is trying to address.

diffusion large language model

audio understanding

autoregressive model

inference efficiency

large-scale training

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion large language model

audio understanding

dual adapters