DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large audio-language models (LALMs) face a fundamental trade-off between enhanced auditory perception and catastrophic forgetting of linguistic capabilities. Method: This paper proposes DeSTA, a self-generating cross-modal alignment framework wherein a backbone large language model autonomously synthesizes high-quality audio-text alignment data—without task-specific instruction tuning. Leveraging 7,000 hours of diverse, multi-source audio, we construct DeSTA-AQA5M, a 5-million-sample dataset enabling zero-shot robust audio understanding and language generation. Contribution/Results: DeSTA achieves state-of-the-art or leading performance on major benchmarks—including Dynamic-SUPERB, MMAU, and SAKURA—significantly outperforming conventional supervised data construction and training paradigms. To our knowledge, this is the first work to empirically validate both the effectiveness and scalability of LLM-generated alignment data for audio-language modeling, establishing a new paradigm for self-supervised multimodal pretraining.

Technology Category

Application Category

📝 Abstract
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM's original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM's native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.
Problem

Research questions and friction points this paper is trying to address.

Develop general-purpose audio-language model without task-specific tuning
Prevent catastrophic forgetting of LLM's original language abilities
Achieve robust audio-text alignment for zero-shot generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-generated cross-modal alignment strategy
Task-agnostic dataset with 5M samples
Preserves LLM language proficiency
🔎 Similar Papers
No similar papers found.
Ke-Han Lu
Ke-Han Lu
National Taiwan University
Nature Language ProcessingSpeech Recognition
Zhehuai Chen
Zhehuai Chen
NVIDIA
Speech RecognitionSpeech SynthesisLLM
Szu-Wei Fu
Szu-Wei Fu
NVIDIA
Machine learningDeep learningSpeech processing
Chao-Han Huck Yang
Chao-Han Huck Yang
Sr. Research Scientist, NVIDIA Research
Robust Speech RecognitionLanguage ModelsPost-TrainingSequence Modeling
Sung-Feng Huang
Sung-Feng Huang
Research Scientist, Nvidia
machine learningspeech processingnatural language processing
C
Chih-Kai Yang
C
Chee-En Yu
C
Chun-Wei Chen
Wei-Chih Chen
Wei-Chih Chen
National Taiwan University
speech processingdiffusion model
Chien-yu Huang
Chien-yu Huang
Carnegie Mellon University
speech processingnatural language processingdeep learning
Yi-Cheng Lin
Yi-Cheng Lin
National Taiwan University
Speech ProcessingMachine LearningFairness
Yu-Xiang Lin
Yu-Xiang Lin
National Taiwan University
NLPSpeech
C
Chi-An Fu
Chun-Yi Kuan
Chun-Yi Kuan
National Taiwan University
Speech ProcessingDeep LearningSpoken Language Understanding
Wenze Ren
Wenze Ren
National Taiwan University; PHD @ Sinica Bio-ASP & NTU SPML Lab
Audio-visual
Xuanjun Chen
Xuanjun Chen
National Taiwan University
Speech ProcessingMachine LearningGenerative AIDeepfakes
Wei-Ping Huang
Wei-Ping Huang
National Taiwan University
Speech ProcessingContinual learningSelf-supervised learningMachine Learning
En-Pei Hu
En-Pei Hu
National Taiwan University
NLPRoboticsRL
Tzu-Quan Lin
Tzu-Quan Lin
National Taiwan University
Self-Supervised LearningSpoken Language ModelsModel CompressionInterpretability
Yuan-Kuei Wu
Yuan-Kuei Wu
National Taiwan University
deep learningspeech processing
K
Kuan-Po Huang
H
Hsiao-Ying Huang
Huang-Cheng Chou
Huang-Cheng Chou
Postdoctoral Scholar - NSTC Fellow, USC Viterbi School of Engineering (formerly Amazon and Realtek)
Affective ComputingSpoken Language UnderstandingEmotion RecognitionDeception Detection
K
Kai-Wei Chang
C
Cheng-Han Chiang