Goodness-of-pronunciation without phoneme time alignment

πŸ“… 2026-03-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of pronunciation quality assessment (PQA) in low-resource languages, where phoneme-level time-aligned annotations are typically unavailable, rendering conventional methods inapplicable. The authors propose a weakly supervised PQA framework that operates without phoneme-level alignments by leveraging multilingual automatic speech recognition (ASR) to generate word-level hypotheses. From these, a phoneme confusion network is constructed to derive phoneme posterior probabilities, which are then integrated with word-level speaking rate and duration features. A cross-attention mechanism jointly models frame-level and phoneme-level information within this framework. To the best of the authors’ knowledge, this is the first approach to achieve ASR-based PQA under strictly alignment-free conditions. Experiments on the English Speechocean762 benchmark and a low-resource Tamil dataset demonstrate performance comparable to state-of-the-art methods that rely on frame-synchronized phonetic annotations, significantly enhancing applicability in resource-constrained settings.

Technology Category

Application Category

πŸ“ Abstract
In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.
Problem

Research questions and friction points this paper is trying to address.

speech evaluation
low-resource languages
phoneme time alignment
weakly-supervised ASR
feature extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

goodness-of-pronunciation
weakly-supervised ASR
phoneme confusion network
cross-attention
low-resource languages
πŸ”Ž Similar Papers
No similar papers found.