Non-native Children's Automatic Speech Assessment Challenge (NOCASA)

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the dual challenges of scarce training data and severe class imbalance in automatic pronunciation assessment for non-native children. To tackle these issues, we propose an end-to-end speech scoring method tailored to few-shot and imbalanced-class scenarios. We introduce TeflonNorL2—the first pseudo-anonymized Norwegian L2 children’s pronunciation dataset—and design a multitask learning framework based on wav2vec 2.0 that jointly optimizes phoneme accuracy prediction and fine-grained proficiency-level classification. To mitigate class imbalance, we incorporate dedicated imbalance-aware learning strategies and acoustic feature augmentation. Experimental results demonstrate that our model achieves a 36.37% unweighted average recall (UAR) on the test set, significantly outperforming the ComParE_16+SVM baseline. The proposed approach delivers a deployable, robust, low-resource solution for gamified second-language oral training systems.

Technology Category

Application Category

📝 Abstract
This paper presents the"Non-native Children's Automatic Speech Assessment"(NOCASA) - a data competition part of the IEEE MLSP 2025 conference. NOCASA challenges participants to develop new systems that can assess single-word pronunciations of young second language (L2) learners as part of a gamified pronunciation training app. To achieve this, several issues must be addressed, most notably the limited nature of available training data and the highly unbalanced distribution among the pronunciation level categories. To expedite the development, we provide a pseudo-anonymized training data (TeflonNorL2), containing 10,334 recordings from 44 speakers attempting to pronounce 205 distinct Norwegian words, human-rated on a 1 to 5 scale (number of stars that should be given in the game). In addition to the data, two already trained systems are released as official baselines: an SVM classifier trained on the ComParE_16 acoustic feature set and a multi-task wav2vec 2.0 model. The latter achieves the best performance on the challenge test set, with an unweighted average recall (UAR) of 36.37%.
Problem

Research questions and friction points this paper is trying to address.

Assess non-native children's single-word pronunciation accuracy
Address limited and unbalanced training data issues
Develop gamified L2 speech assessment systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses SVM classifier with ComParE_16 features
Employs multi-task wav2vec 2.0 model
Provides pseudo-anonymized training data TeflonNorL2
🔎 Similar Papers
Yaroslav Getman
Yaroslav Getman
Aalto University
ASR
T
Tam'as Gr'osz
Department of Information and Communications Engineering, Aalto University, Finland
Mikko Kurimo
Mikko Kurimo
Professor in Speech and Language Processing, Aalto University, Finland
speech recognitionmachine learninglanguage modeling
G
G. Salvi
Department of Electronic Systems, NTNU, Norway