Goodness-of-pronunciation without phoneme time alignment

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of pronunciation quality assessment (PQA) in low-resource languages, where phoneme-level time-aligned annotations are typically unavailable, rendering conventional methods inapplicable. The authors propose a weakly supervised PQA framework that operates without phoneme-level alignments by leveraging multilingual automatic speech recognition (ASR) to generate word-level hypotheses. From these, a phoneme confusion network is constructed to derive phoneme posterior probabilities, which are then integrated with word-level speaking rate and duration features. A cross-attention mechanism jointly models frame-level and phoneme-level information within this framework. To the best of the authors’ knowledge, this is the first approach to achieve ASR-based PQA under strictly alignment-free conditions. Experiments on the English Speechocean762 benchmark and a low-resource Tamil dataset demonstrate performance comparable to state-of-the-art methods that rely on frame-synchronized phonetic annotations, significantly enhancing applicability in resource-constrained settings.

Technology Category

Application Category

📝 Abstract

In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.

Problem

Research questions and friction points this paper is trying to address.

speech evaluation

low-resource languages

phoneme time alignment

weakly-supervised ASR

feature extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

goodness-of-pronunciation

weakly-supervised ASR

phoneme confusion network