🤖 AI Summary
This study addresses automatic pronunciation assessment for Norwegian-as-a-second-language children. To overcome limitations of conventional alignment-dependent goodness-of-pronunciation (GOP) computation, we propose an end-to-end, word-level, alignment-free GOP feature extraction model based on Connectionist Temporal Classification (CTC). We further introduce a weighted ordinal cross-entropy loss function to jointly optimize unweighted average recall (UAR) and mean absolute error (MAE). The model employs an encoder–decoder Siamese architecture that integrates a prefix-tuned wav2vec 2.0 classifier with CTC-derived GOP features. Evaluated on a benchmark pronunciation assessment task, our method achieves state-of-the-art performance, ranking first on the official leaderboard. Results demonstrate significant improvements over all baselines, validating its effectiveness and robustness for diagnosing pronunciation errors in non-native child learners.
📝 Abstract
This paper presents an analysis of three end-to-end models developed for the NOCASA 2025 Challenge, aimed at automatic word-level pronunciation assessment for children learning Norwegian as a second language. Our models include an encoder-decoder Siamese architecture (E2E-R), a prefix-tuned direct classification model leveraging pretrained wav2vec2.0 representations, and a novel model integrating alignment-free goodness-of-pronunciation (GOP) features computed via CTC. We introduce a weighted ordinal cross-entropy loss tailored for optimizing metrics such as unweighted average recall and mean absolute error. Among the explored methods, our GOP-CTC-based model achieved the highest performance, substantially surpassing challenge baselines and attaining top leaderboard scores.