🤖 AI Summary
This study systematically evaluates racial bias in four major commercial automatic speech recognition (ASR) systems. Using the Pacific Northwest English Corpus—featuring speakers from African American, White, Chicano, and Yakama communities—the authors quantify cross-ethnic transcription disparities via a novel phoneme error rate (PER) metric integrated with sociophonetic annotations. Results reveal significantly higher PER for African American speakers across all systems; critically, all models substantially underrepresent dialectal phenomena such as low-vowel mergers, confirming that inadequate acoustic modeling of sociophonetic variation constitutes the primary source of bias. The study introduces an analytical framework linking PER to fine-grained sociophonetic features, identifying vowel quality variation as a key determinant of performance disparity. These findings underscore the necessity of incorporating dialectal diversity into ASR training and evaluation to advance fairness and robustness in speech technology.
📝 Abstract
This paper presents a systematic evaluation of racial bias in four major commercial automatic speech recognition (ASR) systems using the Pacific Northwest English (PNWE) corpus. We analyze transcription accuracy across speakers from four ethnic backgrounds (African American, Caucasian American, ChicanX, and Yakama) and examine how sociophonetic variation contributes to differential system performance. We introduce a heuristically-determined Phonetic Error Rate (PER) metric that links recognition errors to specific linguistically motivated variables derived from sociophonetic annotation. Our analysis of eleven sociophonetic features reveals that vowel quality variation, particularly resistance to the low-back merger and pre-nasal merger patterns, is systematically associated with differential error rates across ethnic groups, with the most pronounced effects for African American speakers across all evaluated systems. These findings demonstrate that acoustic modeling of dialectal phonetic variation, rather than lexical or syntactic factors, remains a primary source of bias in commercial ASR systems. The study establishes the PNWE corpus as a valuable resource for bias evaluation in speech technologies and provides actionable guidance for improving ASR performance through targeted representation of sociophonetic diversity in training data.