What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

๐Ÿ“… 2025-08-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the gap between automatic speech recognition (ASR) and human auditory cognition in spoken dialogue systems (SDS), specifically investigating how humans perform selective listening during dialogue and what recognition capabilities ASR must acquire to approach human performance. Using experimental psychology paradigms, we quantify human information selection preferences in natural conversations via manual transcription analysis, dialogue response generation tasks, and attention pattern modeling. We propose a novel, cognition-grounded ASR evaluation frameworkโ€”first to operationalize selective listening as measurable cognitive metrics. Experiments reveal that humans consistently ignore redundant acoustic segments and prioritize semantically critical units (e.g., intent verbs, entity nouns), whereas state-of-the-art ASR systems exhibit systematic deficits in capturing such units. Our work establishes a cognitively informed benchmark for ASR evaluation, advancing the field from lexical accuracy toward semantic relevance.

Technology Category

Application Category

๐Ÿ“ Abstract
Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.
Problem

Research questions and friction points this paper is trying to address.

Investigates human selective listening in dialogue interactions
Compares human and ASR transcription for response generation
Proposes new ASR evaluation method using human listening patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human selective listening for ASR evaluation
Comparison of human and reference transcriptions
New ASR evaluation method leveraging human focus
๐Ÿ”Ž Similar Papers
K
Kiyotada Mori
Nara Institute of Science and Technology, Japan; Guardian Robot Project, RIKEN, Japan
S
Seiya Kawano
Guardian Robot Project, RIKEN, Japan; Nara Institute of Science and Technology, Japan
C
Chaoran Liu
Guardian Robot Project, RIKEN, Japan; National Institute of Informatics, Japan
C
Carlos Toshinori Ishi
Guardian Robot Project, RIKEN, Japan
A
Angel Fernando Garcia Contreras
Guardian Robot Project, RIKEN, Japan
Koichiro Yoshino
Koichiro Yoshino
Tokyo Institute of Technology / GRP, RIKEN
spoken dialogue systemsnatural language processingspoken language processinghuman robot