🤖 AI Summary
This study addresses the poor automatic speech recognition (ASR) performance for low-resource Frisian—comprising Clay, Wood, and South dialects—and identifies data elicitation methodology as a critical factor affecting model robustness, demonstrating that evaluation solely on standard Frisian severely underestimates real-world performance. To overcome this, we propose a novel multilingual fine-tuning paradigm jointly optimizing ASR for Frisian, Dutch, English, and German, augmented with an auxiliary language identification task—thereby transcending the limitations of monolingual fine-tuning. Leveraging wav2vec 2.0 and XLS-R self-supervised models, we systematically compare dialect-specific speech elicitation strategies and quantify substantial performance gaps: dialectal WER exceeds standard Frisian WER by 30–50%, with elicitation method exerting significant modulation. Our approach reduces average dialectal WER by 12.4% and markedly improves cross-dialect generalization.
📝 Abstract
Automatic Speech Recognition (ASR) performance for low-resource languages is still far behind that of higher-resource languages such as English, due to a lack of sufficient labeled data. State-of-the-art methods deploy self-supervised transfer learning where a model pre-trained on large amounts of data is fine-tuned using little labeled data in a target low-resource language. In this paper, we present and examine a method for fine-tuning an SSL-based model in order to improve the performance for Frisian and its regional dialects (Clay Frisian, Wood Frisian, and South Frisian). We show that Frisian ASR performance can be improved by using multilingual (Frisian, Dutch, English and German) fine-tuning data and an auxiliary language identification task. In addition, our findings show that performance on dialectal speech suffers substantially, and, importantly, that this effect is moderated by the elicitation approach used to collect the dialectal data. Our findings also particularly suggest that relying solely on standard language data for ASR evaluation may underestimate real-world performance, particularly in languages with substantial dialectal variation.