ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models

πŸ“… 2025-10-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large audio-language models (LALMs) exhibit high sensitivity to instruction phrasing, yet existing benchmarks lack systematic evaluation of this property. To address this gap, we propose ISA-Benchβ€”the first multidimensional, dynamic benchmark explicitly designed to assess instruction sensitivity in LALMs, quantifying robustness across three dimensions: instruction description, output format, and task composition. Leveraging controlled variable methodology, we construct diverse instruction variants and conduct empirical evaluation using both open- and closed-source evaluation frameworks. Our analysis reveals significant performance degradation under instruction perturbations across mainstream LALMs and, for the first time, identifies catastrophic forgetting induced by complex-instruction fine-tuning. Experiments based on Qwen2-Audio demonstrate that while fine-tuning improves instruction robustness, it concurrently degrades performance on original tasks. This work establishes a novel evaluation paradigm and actionable improvement pathways for developing instruction-robust audio-language systems.

Technology Category

Application Category

πŸ“ Abstract
Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense interest from both academic and industrial communities. However, existing LALMs are highly sensitive to how instructions are phrased, affecting both (i) instruction-following rates and (ii) task performance. Yet, no existing benchmarks offer a systematic and comprehensive evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition. We assess recent open-source and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy under controlled instruction variations. Experimental results reveal that even state-of-the-art LALMs suffer significant instruction sensitivity, leading to degraded performance on fundamental audio understanding tasks. To mitigate this issue, we fine-tune Qwen2-Audio on a specifically constructed complex instruction-variant dataset, achieving a marked improvement in instruction-following performance. However, this also induces nontrivial catastrophic forgetting: the model loses some previously mastered task capabilities when exposed to new instruction styles. Our benchmark provides a standardized basis for assessing and improving instruction sensitivity in LALMs, underscoring the need for instruction-robust audio understanding in real-world pipelines.
Problem

Research questions and friction points this paper is trying to address.

Evaluating instruction sensitivity in large audio language models
Assessing compliance and accuracy under instruction variations
Mitigating performance degradation from instruction phrasing differences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates instruction sensitivity along three axes
Fine-tunes model on complex instruction-variant dataset
Mitigates catastrophic forgetting while improving instruction-following
πŸ”Ž Similar Papers
No similar papers found.
B
Bohan Li
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
W
Wenbin Huang
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
Y
Yuhang Qiu
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
Y
Yiwei Guo
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
Hankun Wang
Hankun Wang
Shanghai Jiao Tong University
Speech Synthesis
Zhihan Li
Zhihan Li
Kuaishou Technology, Tsinghua University
Anomaly DetectionAIOps
J
Jing Peng
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
Z
Ziyang Ma
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
X
Xie Chen
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
K
Kai Yu
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China