Comparison of End-to-end Speech Assessment Models for the NOCASA 2025 Challenge

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This study addresses automatic pronunciation assessment for Norwegian-as-a-second-language children. To overcome limitations of conventional alignment-dependent goodness-of-pronunciation (GOP) computation, we propose an end-to-end, word-level, alignment-free GOP feature extraction model based on Connectionist Temporal Classification (CTC). We further introduce a weighted ordinal cross-entropy loss function to jointly optimize unweighted average recall (UAR) and mean absolute error (MAE). The model employs an encoder–decoder Siamese architecture that integrates a prefix-tuned wav2vec 2.0 classifier with CTC-derived GOP features. Evaluated on a benchmark pronunciation assessment task, our method achieves state-of-the-art performance, ranking first on the official leaderboard. Results demonstrate significant improvements over all baselines, validating its effectiveness and robustness for diagnosing pronunciation errors in non-native child learners.

Technology Category

Application Category

📝 Abstract

This paper presents an analysis of three end-to-end models developed for the NOCASA 2025 Challenge, aimed at automatic word-level pronunciation assessment for children learning Norwegian as a second language. Our models include an encoder-decoder Siamese architecture (E2E-R), a prefix-tuned direct classification model leveraging pretrained wav2vec2.0 representations, and a novel model integrating alignment-free goodness-of-pronunciation (GOP) features computed via CTC. We introduce a weighted ordinal cross-entropy loss tailored for optimizing metrics such as unweighted average recall and mean absolute error. Among the explored methods, our GOP-CTC-based model achieved the highest performance, substantially surpassing challenge baselines and attaining top leaderboard scores.

Problem

Research questions and friction points this paper is trying to address.

Assessing word-level pronunciation for children

Evaluating Norwegian second language learners

Comparing end-to-end speech assessment models

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end models for pronunciation assessment

Weighted ordinal cross-entropy loss optimization

CTC-based alignment-free GOP feature integration

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection