🤖 AI Summary
This study investigates systemic biases in the Whisper model for Scottish dialect speech recognition, addressing accent-based inequities that hinder equitable access to UK public services. Methodologically, we construct the first publicly oriented, multi-dialect Scottish speech dataset; perform supervised fine-tuning on Whisper large-v3; and design cross-regional accent contrast experiments complemented by manual error analysis. Key contributions include: (1) the first systematic characterization of Whisper’s performance degradation in bilingual/multi-dialect Scottish contexts; (2) empirical validation of the fine-tuned model’s cross-regional accent transfer capability; (3) identification of Word Error Rate (WER) limitations for dialectal evaluation, particularly its insensitivity to dialect-specific errors; and (4) significant WER reduction on target dialects post-fine-tuning, with marked alleviation of dialect-specific errors—including morphological misrecognitions and code-mixed transcription failures.
📝 Abstract
We collect novel data in the public service domain to evaluate the capability of the state-of-the-art automatic speech recognition (ASR) models in capturing regional differences in accents in the United Kingdom (UK), specifically focusing on two accents from Scotland with distinct dialects. This study addresses real-world problems where biased ASR models can lead to miscommunication in public services, disadvantaging individuals with regional accents particularly those in vulnerable populations. We first examine the out-of-the-box performance of the Whisper large-v3 model on a baseline dataset and our data. We then explore the impact of fine-tuning Whisper on the performance in the two UK regions and investigate the effectiveness of existing model evaluation techniques for our real-world application through manual inspection of model errors. We observe that the Whisper model has a higher word error rate (WER) on our test datasets compared to the baseline data and fine-tuning on a given data improves performance on the test dataset with the same domain and accent. The fine-tuned models also appear to show improved performance when applied to the test data outside of the region it was trained on suggesting that fine-tuned models may be transferable within parts of the UK. Our manual analysis of model outputs reveals the benefits and drawbacks of using WER as an evaluation metric and fine-tuning to adapt to regional dialects.