Talking to Robots: A Practical Examination of Speech Foundation Models for HRI Applications

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the practical robustness limitations of automatic speech recognition (ASR) in human–robot interaction (HRI), systematically evaluating six real-world challenges: domain-specific terminology, speaker accent, environmental noise, age-related vocal variation, speech disorders, and spontaneous speech. We conduct a cross-dimensional evaluation across eight public benchmarks using four state-of-the-art ASR models. Results reveal substantial performance divergence—despite comparable accuracy on standard benchmarks—alongside high error rates, semantic hallucinations, and sociolinguistic biases in realistic HRI settings. These deficiencies critically impair task success, user trust, and interaction safety. To address this gap, we introduce the first HRI-oriented multidimensional robustness evaluation framework. Our analysis uncovers critical weaknesses of current speech foundation models under hardware constraints and noisy conditions. The study provides empirical evidence and a novel assessment paradigm to advance trustworthy ASR deployment in embodied interactive systems.

Technology Category

Application Category

📝 Abstract

Automatic Speech Recognition (ASR) systems in real-world settings need to handle imperfect audio, often degraded by hardware limitations or environmental noise, while accommodating diverse user groups. In human-robot interaction (HRI), these challenges intersect to create a uniquely challenging recognition environment. We evaluate four state-of-the-art ASR systems on eight publicly available datasets that capture six dimensions of difficulty: domain-specific, accented, noisy, age-variant, impaired, and spontaneous speech. Our analysis demonstrates significant variations in performance, hallucination tendencies, and inherent biases, despite similar scores on standard benchmarks. These limitations have serious implications for HRI, where recognition errors can interfere with task performance, user trust, and safety.

Problem

Research questions and friction points this paper is trying to address.

Evaluating ASR performance in noisy real-world HRI environments

Assessing recognition accuracy across diverse user demographics

Identifying system biases and error impacts on robot interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated four ASR systems on diverse datasets

Tested performance across six speech difficulty dimensions

Analyzed hallucination tendencies and inherent biases

🔎 Similar Papers

A Survey of Language-Based Communication in Robotics