A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

๐Ÿ“… 2026-05-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

228K/year
๐Ÿค– AI Summary
While large audio language models (ALMs) have demonstrated rapidly advancing capabilities, they lack systematic trustworthiness guarantees and face emerging security and privacy risksโ€”including cross-modal jailbreaking, acoustic backdoors, and biometric leakage. This work proposes the first trustworthiness taxonomy tailored to ALMs and develops a multidimensional evaluation framework encompassing hallucination, robustness, safety, privacy, fairness, and verifiability. The authors adopt an end-to-end unified modeling paradigm that integrates continuous acoustic signal processing with alignment algorithms. Their analysis reveals a significant gap between current defensive mechanisms and evolving attack strategies. To bridge this gap, the study advocates for novel research directions such as causal auditory world modeling and intrinsic representation engineering, aiming to advance toward inherently trustworthy audio intelligence.
๐Ÿ“ Abstract
The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.
Problem

Research questions and friction points this paper is trying to address.

Large Audio Language Models
Trustworthiness
Privacy Leakage
Cross-modal Jailbreaking
Acoustic Backdoors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trustworthiness
Large Audio Language Models
Defense-in-Depth
Acoustic Backdoors
Multimodal Alignment
๐Ÿ”Ž Similar Papers
No similar papers found.
K
Kaiwen Luo
Nanyang Technological University
Zhenhong Zhou
Zhenhong Zhou
Nanyang Technological University
Large Language ModelAI SafetyLLM Safety
L
Leo Wang
Independent Researcher
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
Yang Xiao
Yang Xiao
The University of Melbourne
Speech Signal ProcessingAudio ProcessingData Centric AI
T
Tianyu Shao
North China Electric Power University
Yuanhe Zhang
Yuanhe Zhang
PhD in Statistics, Department of Statistics, University of Warwick
Learning TheoryReasoningStatistics
Y
Yuxuan Li
University of Chinese Academy of Sciences
Miao Yu
Miao Yu
Undergraduates, University of Science and Technology of China
Trustworthy AIAutonomous AgentAI4Science
K
Kailin Lyu
Institute of Automation, Chinese Academy of Sciences
Jiaming Zhang
Jiaming Zhang
Nanyang Technological University
Trustworthy AIMultimodalComputer VisionMultimedia
D
Dongrui Liu
Shanghai AI Laboratory
L
Li Sun
Beijing University of Posts and Telecommunications
Yueming Wu
Yueming Wu
Huazhong University of Science and Technology
software security
K
Kai Li
Tsinghua University
Ting Dang
Ting Dang
Senior Lecturer in AI for Health, The University of Melbourne
Mobile HealthAudio ProcessingAffective ComputingTime Series ModellingWearable Sensing
Xiaojun Jia
Xiaojun Jia
Nanyang Technological University
Explainable AIRobust AIEfficient AI
Rohan Kumar Das
Rohan Kumar Das
Fortemedia Singapore
Speech ProcessingSpeaker VerificationAnti-spoofingDeep LearningHuman-Computer Interaction
X
Xinfeng Li
Nanyang Technological University
Siyuan Liang
Siyuan Liang
College of Computing and Data Science, Nanyang Technological University
Trustworthy Foundation Model
Q
Qiufeng Wang
Tencent
Xingjun Ma
Xingjun Ma
Fudan University
Trustworthy AIMultimodal AIGenerative AIEmbodied AI
Jing Chen
Jing Chen
Professor, Wuhan University
Network SecurityCloud SecurityMobile Security
Kun Wang
Kun Wang
Singapore University of Technology and Design
Deep LearningComputer Vision
Junhao Dong
Junhao Dong
Nanyang Technological University
AI SafetyRobust AI