What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the gap between automatic speech recognition (ASR) and human auditory cognition in spoken dialogue systems (SDS), specifically investigating how humans perform selective listening during dialogue and what recognition capabilities ASR must acquire to approach human performance. Using experimental psychology paradigms, we quantify human information selection preferences in natural conversations via manual transcription analysis, dialogue response generation tasks, and attention pattern modeling. We propose a novel, cognition-grounded ASR evaluation framework—first to operationalize selective listening as measurable cognitive metrics. Experiments reveal that humans consistently ignore redundant acoustic segments and prioritize semantically critical units (e.g., intent verbs, entity nouns), whereas state-of-the-art ASR systems exhibit systematic deficits in capturing such units. Our work establishes a cognitively informed benchmark for ASR evaluation, advancing the field from lexical accuracy toward semantic relevance.

Technology Category

Application Category

📝 Abstract

Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.

Problem

Research questions and friction points this paper is trying to address.

Investigates human selective listening in dialogue interactions

Compares human and ASR transcription for response generation

Proposes new ASR evaluation method using human listening patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human selective listening for ASR evaluation

Comparison of human and reference transcriptions

New ASR evaluation method leveraging human focus

🔎 Similar Papers

People are poorly equipped to detect AI-powered voice clones