🤖 AI Summary
This study addresses the bottleneck in pronunciation error detection and diagnosis (MDD) that arises from heavy reliance on large-scale annotated data and model training. We propose a novel retrieval-based approach that requires no model training whatsoever. Our method leverages pre-trained automatic speech recognition (ASR) models to extract utterance-level speech representations and performs cross-utterance phoneme-segment similarity retrieval to localize mispronunciations and deliver phoneme-level diagnostic feedback—bypassing phoneme modeling, fine-tuning, and task-specific training entirely. To our knowledge, this is the first work to introduce the retrieval paradigm into MDD, substantially lowering deployment barriers and enhancing generalizability across languages and speakers. Evaluated on the L2-ARCTIC benchmark, our method achieves a 69.60% F1 score—significantly outperforming all training-free baselines—demonstrating both effectiveness and practical utility.
📝 Abstract
Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.