🤖 AI Summary
This work addresses the practical robustness limitations of automatic speech recognition (ASR) in human–robot interaction (HRI), systematically evaluating six real-world challenges: domain-specific terminology, speaker accent, environmental noise, age-related vocal variation, speech disorders, and spontaneous speech. We conduct a cross-dimensional evaluation across eight public benchmarks using four state-of-the-art ASR models. Results reveal substantial performance divergence—despite comparable accuracy on standard benchmarks—alongside high error rates, semantic hallucinations, and sociolinguistic biases in realistic HRI settings. These deficiencies critically impair task success, user trust, and interaction safety. To address this gap, we introduce the first HRI-oriented multidimensional robustness evaluation framework. Our analysis uncovers critical weaknesses of current speech foundation models under hardware constraints and noisy conditions. The study provides empirical evidence and a novel assessment paradigm to advance trustworthy ASR deployment in embodied interactive systems.
📝 Abstract
Automatic Speech Recognition (ASR) systems in real-world settings need to handle imperfect audio, often degraded by hardware limitations or environmental noise, while accommodating diverse user groups. In human-robot interaction (HRI), these challenges intersect to create a uniquely challenging recognition environment. We evaluate four state-of-the-art ASR systems on eight publicly available datasets that capture six dimensions of difficulty: domain-specific, accented, noisy, age-variant, impaired, and spontaneous speech. Our analysis demonstrates significant variations in performance, hallucination tendencies, and inherent biases, despite similar scores on standard benchmarks. These limitations have serious implications for HRI, where recognition errors can interfere with task performance, user trust, and safety.