AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio large language models (ALLMs) lack systematic, modality-specific trustworthiness evaluation frameworks addressing their unique risks. Method: We introduce the first multidimensional benchmark covering fairness, hallucination, safety, privacy, robustness, and speaker authentication—built upon 4,420+ real-world audio-text samples and 18 experimental configurations. Our methodology includes multimodal audio-text data curation, design of nine audio-specific evaluation metrics, a scalable automated scoring pipeline, and real-scenario-driven adversarial testing. Contribution/Results: We formally define and quantify audio-specific trustworthiness risks for the first time, open-source an extensible, automated ALLM trustworthiness evaluation platform, and empirically demonstrate pervasive hallucinations, privacy leakage, and speaker misidentification in mainstream ALLMs under high-risk audio conditions—providing empirical foundations and technical support for trustworthy deployment.

Technology Category

Application Category

📝 Abstract
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.
Problem

Research questions and friction points this paper is trying to address.

Evaluating trustworthiness of Audio Large Language Models (ALLMs) comprehensively
Addressing unique audio modality risks ignored by existing frameworks
Assessing ALLMs across fairness, safety, privacy, and robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multifaceted trustworthiness evaluation framework for ALLMs
Dataset with 4,420 real-world audio/text samples
Nine audio-specific evaluation metrics for scoring
🔎 Similar Papers
No similar papers found.
K
Kai Li
Tsinghua University
C
Can Shen
Nanyang Technological University
Y
Yile Liu
Waseda University
J
Jirui Han
Independent Researcher
K
Kelong Zheng
HUST
X
Xuechao Zou
BJTU
Z
Zhe Wang
Hong Kong Polytechnic University
X
Xingjian Du
University of Rochester
S
Shun Zhang
QHU
Hanjun Luo
Hanjun Luo
New York University Abu Dhbai
Trustworthy AILarge Language ModelText-to-Image
Y
Yingbin Jin
Zhejiang University
X
Xinxin Xing
Independent Researcher
Z
Ziyang Ma
Shanghai Jiao Tong University
Y
Yue Liu
National Univeristy of Singapore
Xiaojun Jia
Xiaojun Jia
Nanyang Technological University
Explainable AIRobust AIEfficient AI
Y
Yifan Zhang
CAS
Junfeng Fang
Junfeng Fang
National University of Singapore
Model EditingAI SafetyLLM ExplainabilityAI4Science
K
Kun Wang
Nanyang Technological University
Yibo Yan
Yibo Yan
East China Normal University
High-dimensional Statistics
H
Haoyang Li
Hong Kong Polytechnic University
Y
Yiming Li
Nanyang Technological University
Xiaobin Zhuang
Xiaobin Zhuang
Bytedance
Audio Generation
Y
Yang Liu
Nanyang Technological University
H
Haibo Hu
Hong Kong Polytechnic University
Z
Zhuo Chen
Bytedance
Zhizheng Wu
Zhizheng Wu
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Mel Lab
Spoken Language ProcessingDeepFake detectionMusic Processing
X
Xiaolin Hu
Nanyang Technological University
Eng-Siong Chng
Eng-Siong Chng
Nanyang Technological University
Speech and Language processingDigital Signal ProcessingPattern Recognition
XiaoFeng Wang
XiaoFeng Wang
Chair, ACM SIGSAC
AI-Centered SecuritySystems Security and PrivacyHealthcare PrivacyIncentive Engineering
Wenyuan Xu
Wenyuan Xu
Professor, IEEE Fellow, Zhejiang University, College of EE
Wireless Network SecurityEmbedded System SecurityAnalog Cyber SecurityIoT Security
W
Wei Dong
Nanyang Technological University
X
Xinfeng Li
Nanyang Technological University