DIFFA: Large Language Diffusion Models Can Listen and Understand

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio-language models predominantly adopt autoregressive paradigms, struggling to balance generation quality and inference efficiency; diffusion models remain unexplored for speech understanding. This paper introduces DIFFA—the first diffusion-based large-scale audio-language model tailored for spoken language understanding. Methodologically, DIFFA (1) freezes a pretrained diffusion language model and employs lightweight dual adapters for audio–text cross-modal alignment; (2) proposes a two-stage training paradigm leveraging both ASR alignment supervision and synthetic audio-caption instruction data generated by large language models; and (3) supports bidirectional context modeling and controllable semantic-speech generation. Trained on only 960 hours of real ASR data and 127 hours of synthetic data, DIFFA significantly outperforms leading open-source autoregressive models on MMSU, MMAU, and VoiceBench benchmarks. Results demonstrate the efficacy and scalability of diffusion mechanisms for efficient, high-performance spoken language understanding.

Technology Category

Application Category

📝 Abstract
Recent advances in Large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce extbf{DIFFA}, the first diffusion-based Large Audio-Language Model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at https://github.com/NKU-HLT/DIFFA.git.
Problem

Research questions and friction points this paper is trying to address.

Exploring diffusion models for audio understanding tasks
Bridging speech and language with lightweight dual-adapter
Enhancing audio-language models with synthetic instruction data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based Large Audio-Language Model
Dual-adapter bridges speech and language
Two-stage training with ASR and LLMs
🔎 Similar Papers
No similar papers found.
J
Jiaming Zhou
College of Computer Science, Nankai University
H
Hongjie Chen
Institute of Artificial Intelligence (TeleAI), China Telecom, China
Shiwan Zhao
Shiwan Zhao
Independent Researcher, Research Scientist of IBM Research - China (2000-2020)
AGILarge Language ModelNLPSpeechRecommeder System
J
Jian Kang
Institute of Artificial Intelligence (TeleAI), China Telecom, China
J
Jie Li
Institute of Artificial Intelligence (TeleAI), China Telecom, China
Enzhi Wang
Enzhi Wang
Nankai University
Machine learningdata miningnatural language processing
Yujie Guo
Yujie Guo
yujie.guo@ugent.be
low dimensional semiconductors
Haoqin Sun
Haoqin Sun
Nankai University
Affective computingSpeech signal processingAudio understanding
H
Hui Wang
College of Computer Science, Nankai University
Aobo Kong
Aobo Kong
Nankai University
NLPLLM
Y
Yong Qin
College of Computer Science, Nankai University
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom, China