🤖 AI Summary
This study addresses potential recognition biases in YouTube’s Spanish automatic speech recognition (ASR) system, which relies on a single monolithic model and may exhibit disparities across dialectal and gender dimensions. For the first time, the work systematically evaluates the system’s performance across multiple Spanish dialects and speakers of different genders, integrating ASR accuracy metrics, dialect classification, gender annotation, and quantitative analysis of caption quality. The findings reveal significantly higher error rates for captions of specific regional varieties—particularly Caribbean and Central American dialects—and for female speakers, exposing structural biases at the intersection of linguistic variation and gender. These results underscore the limitations of dominant ASR technologies in accommodating sociolinguistic diversity and provide empirical evidence to inform the development of more equitable, fairness-aware speech recognition systems.
📝 Abstract
Spanish is the official language of twenty-one countries and is spoken by over 441 million people. Naturally, there are many variations in how Spanish is spoken across these countries. Media platforms such as YouTube rely on automatic speech recognition systems to make their content accessible to different groups of users. However, YouTube offers only one option for automatically generating captions in Spanish. This raises the question: could this captioning system be biased against certain Spanish dialects? This study examines the potential biases in YouTube's automatic captioning system by analyzing its performance across various Spanish dialects. By comparing the quality of captions for female and male speakers from different regions, we identify systematic disparities which can be attributed to specific dialects. Our study provides further evidence that algorithmic technologies deployed on digital platforms need to be calibrated to the diverse needs and experiences of their user populations.