Pardon? Evaluating Conversational Repair in Large Audio-Language Models

๐Ÿ“… 2026-01-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current evaluations of large audio language models predominantly emphasize answer accuracy while overlooking the modelsโ€™ ability to recover from semantically unanswerable inputs, thus failing to reflect real-world interaction reliability. This work proposes the first repair-aware evaluation framework, which constructs paired answerable and unanswerable inputs through a semantic-acoustic masking protocol and introduces a non-compensatory Evaluability Awareness and Repair (EAR) metric to jointly assess both task performance and repair awareness. Experiments on two spoken question-answering benchmarks reveal that prevailing models generally lack repair awareness, underscoring the critical role of repair capability in system reliability and challenging the traditional accuracy-centric evaluation paradigm.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Audio-Language Models (LALMs) have demonstrated strong performance in spoken question answering (QA), with existing evaluations primarily focusing on answer accuracy and robustness to acoustic perturbations. However, such evaluations implicitly assume that spoken inputs remain semantically answerable, an assumption that often fails in real-world interaction when essential information is missing. In this work, we introduce a repair-aware evaluation setting that explicitly distinguishes between answerable and unanswerable audio inputs. We define answerability as a property of the input itself and construct paired evaluation conditions using a semantic-acoustic masking protocol. Based on this setting, we propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions. Experiments on two spoken QA benchmarks across diverse LALMs reveal a consistent gap between answer accuracy and conversational reliability: while many models perform well when inputs are answerable, most fail to recognize semantic unanswerability and initiate appropriate conversational repair. These findings expose a limitation of prevailing accuracy-centric evaluation practices and motivate reliability assessments that treat unanswerable inputs as cues for repair and continued interaction.
Problem

Research questions and friction points this paper is trying to address.

conversational repair
answerability
audio-language models
spoken question answering
evaluation reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational repair
answerability
audio-language models
evaluation metric
semantic masking
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shuanghong Huang
Beijing Institute of Technology
J
Jinlei Xu
Beijing Institute of Technology
Y
Youchao Zhou
Beijing Institute of Technology
Y
Yanghao Zhou
Beijing Institute of Technology
Xuan Zhao
Xuan Zhao
PhD, Forschungszentrum Jรผlich GmbH
XAIFair AI
C
Chong Feng
Beijing Institute of Technology
Wenxuan Zhang
Wenxuan Zhang
Singapore University of Technology and Design
Natural Language ProcessingLarge Language ModelsMultilingual NLP