๐ค AI Summary
This study addresses the gap between automatic speech recognition (ASR) and human auditory cognition in spoken dialogue systems (SDS), specifically investigating how humans perform selective listening during dialogue and what recognition capabilities ASR must acquire to approach human performance. Using experimental psychology paradigms, we quantify human information selection preferences in natural conversations via manual transcription analysis, dialogue response generation tasks, and attention pattern modeling. We propose a novel, cognition-grounded ASR evaluation frameworkโfirst to operationalize selective listening as measurable cognitive metrics. Experiments reveal that humans consistently ignore redundant acoustic segments and prioritize semantically critical units (e.g., intent verbs, entity nouns), whereas state-of-the-art ASR systems exhibit systematic deficits in capturing such units. Our work establishes a cognitively informed benchmark for ASR evaluation, advancing the field from lexical accuracy toward semantic relevance.
๐ Abstract
Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.