"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This study addresses the poor performance of mainstream automatic speech recognition (ASR) systems in transcribing high-risk phrases—such as U.S. street names—particularly for non-native English speakers, which can lead to severe navigation errors. The authors conduct a systematic evaluation of 15 commercial ASR models from OpenAI, Deepgram, Google, and Microsoft across diverse U.S. accents, uncovering significant reliability gaps in real-world scenarios. To improve transcription accuracy, they propose fine-tuning these models with a small amount (<1,000 utterances) of synthetic speech data generated by open-source text-to-speech (TTS) systems. Experimental results demonstrate that this approach yields a nearly 60% relative improvement in street name transcription accuracy for non-native speakers, substantially mitigating fairness and robustness issues in current ASR deployments.

Technology Category

Application Category

📝 Abstract

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

Problem

Research questions and friction points this paper is trying to address.

speech recognition

transcription errors

high-stakes utterances

named entity pronunciation

linguistic diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation

speech recognition fairness

high-stakes transcription