Towards Reliable Large Audio Language Model

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large audio-language models (LALMs) commonly lack knowledge boundary awareness, hindering their ability to proactively abstain from answering out-of-distribution or unknown queries. This work systematically investigates two reliability-enhancement paradigms: intervention-free approaches (e.g., multimodal chain-of-thought, MCoT) and training-driven methods (e.g., supervised fine-tuning, SFT). We propose the Reliability Gain Index (RGI) as a novel, quantitative evaluation metric and empirically demonstrate— for the first time—that “reliability awareness” is a cross-modal meta-capability transferable across speech, music, and environmental sound domains. Experiments show that both paradigms significantly improve LALMs’ proactive abstention rates on unknown questions; notably, MCoT achieves reliable abstention without requiring additional annotated data. Our study establishes a quantifiable, generalizable methodological foundation for trustworthy LALM deployment.

Technology Category

Application Category

📝 Abstract
Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a"meta ability", which can be transferred across different audio modalities, although significant structural and content differences exist among sound, music, and speech.
Problem

Research questions and friction points this paper is trying to address.

Enhancing reliability of large audio language models (LALMs)
Recognizing knowledge boundaries in LALMs
Developing evaluation metrics for reliable LALMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal chain-of-thought (MCoT) training-free method
Supervised fine-tuning (SFT) training-based method
Reliability Gain Index (RGI) new evaluation metric
🔎 Similar Papers
No similar papers found.
Z
Ziyang Ma
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
Xiquan Li
Xiquan Li
Shanghai Jiao Tong University
Audio UnderstandingAudio GenerationLarge Language Models
Y
Yakun Song
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
W
Wenxi Chen
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
Chenpeng Du
Chenpeng Du
ByteDance
Speech Interaction
J
Jian Wu
ByteDance
Y
Yuanzhe Chen
ByteDance
Z
Zhuo Chen
ByteDance
Y
Yuping Wang
ByteDance
Y
Yuxuan Wang
ByteDance
X
Xie Chen
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai Innovation Institute