🤖 AI Summary
To address the scarcity of mispronunciation detection (MD) models for low-resource languages such as Finland Swedish, this paper proposes a minimal second-language (L2) data-dependent framework requiring only 89 hours of native (L1) speech and 33 minutes of unlabeled L2 read speech—no L2 pronunciation annotations are needed. The method leverages multilingual wav2vec 2.0 and incorporates entropy-regularized training, temperature scaling, and top-k normalization for post-processing, enabling language-agnostic and transferable MD modeling. Its core innovation lies in decoupling L1 knowledge distillation from L2 anomaly modeling, drastically reducing reliance on annotated L2 data. Evaluated on a Finland Swedish test set, the approach achieves 43.2% recall and 29.8% precision—improving precision by 12.2 percentage points over the baseline—while maintaining robustness and accuracy.
📝 Abstract
Mispronunciation detection (MD) models are the cornerstones of many language learning applications. Unfortunately, most systems are built for English and other major languages, while low-resourced language varieties, such as Finland Swedish (FS), lack such tools. In this paper, we introduce our MD model for FS, trained on 89 hours of first language (L1) speakers' spontaneous speech and tested on 33 minutes of L2 transcribed read-aloud speech. We trained a multilingual wav2vec 2.0 model with entropy regularization, followed by temperature scaling and top-k normalization after the inference to better adapt it for MD. The main novelty of our method lies in its simplicity, requiring minimal L2 data. The process is also language-independent, making it suitable for other low-resource languages. Our proposed algorithm allows us to balance Recall (43.2%) and Precision (29.8%), compared with the baseline model's Recall (77.5%) and Precision (17.6%).