🤖 AI Summary
This work addresses limitations in existing ASR-based mispronunciation detection methods, which struggle to capture transient pronunciation deviations due to the coarse-grained nature of CTC alignment and suffer from prediction bias when explicitly incorporating canonical pronunciation priors. To overcome these issues, the authors propose a prompt-free decoupled framework that separates acoustic modeling from canonical pronunciation guidance. Specifically, they introduce the CROTTC model to achieve frame-level monotonic alignment for precise deviation localization and integrate an implicit feedback (IF) strategy to inject mispronunciation knowledge without inducing linguistic bias. Evaluated on the L2-ARCTIC and Iqra'Eval2 datasets, the proposed approach achieves F1-scores of 71.77% and 71.70%, respectively, demonstrating significantly improved robustness and accuracy in mispronunciation detection.
📝 Abstract
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cues, while explicit canonical priors bias predictions toward intended targets. To address these bottlenecks, we propose a prompt-free framework decoupling acoustic fidelity from canonical guidance. First, we introduce CROTTC, an acoustic model enforcing monotonic, frame-level alignment to accurately capture pronunciation deviations. Second, we implicitly inject mispronunciation information via the IF strategy under the knowledge transfer principle. Experiments show CROTTC-IF achieves a 71.77% F1-score on L2-ARCTIC and 71.70% F1-score on the Iqra'Eval2 leaderboard. With empirical analysis, we demonstrate that decoupling acoustics from explicit priors provides highly robust MDD.