🤖 AI Summary
Existing audio question-answering benchmarks primarily focus on answerable questions, overlooking common real-world scenarios involving unanswerable queries—such as missing answers, mismatched answer categories, or questions unrelated to the audio content or lacking evidential support—thereby compromising model reliability in practical applications. To address this gap, this work proposes AQUA-Bench, the first systematic evaluation benchmark specifically designed for unanswerable questions in audio QA. It constructs evaluation tasks encompassing three distinct refusal scenarios and quantitatively assesses models’ ability to abstain from answering based on audio-language multimodal understanding. Experimental results demonstrate that while current models perform well on standard answerable tasks, they exhibit significant blind spots when confronted with unanswerable questions, highlighting the critical role of AQUA-Bench in advancing robust and trustworthy audio-language systems.
📝 Abstract
Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.