🤖 AI Summary
Current speech recognition models exhibit limited capability in phoneme-level modeling of pronunciation deviations—such as accents and disfluencies—thereby constraining the accuracy of automatic pronunciation assessment. To address this, we propose an end-to-end approach integrating multi-task learning with explicit phoneme similarity modeling, enabling fine-grained characterization of discrepancies between actual and canonical pronunciations. We construct and publicly release VCTK-accent, the first synthetic dataset specifically designed for pronunciation error modeling. Additionally, we introduce two novel metrics for quantifying pronunciation divergence. Experiments demonstrate that our method significantly improves phoneme-level transcription accuracy, particularly under non-native or atypical pronunciation conditions, and exhibits enhanced robustness compared to prior approaches. Our work establishes a new benchmark for pronunciation error detection and advances the state of the art in automatic pronunciation assessment.
📝 Abstract
Phonetic error detection, a core subtask of automatic pronunciation assessment, identifies pronunciation deviations at the phoneme level. Speech variability from accents and dysfluencies challenges accurate phoneme recognition, with current models failing to capture these discrepancies effectively. We propose a verbatim phoneme recognition framework using multi-task training with novel phoneme similarity modeling that transcribes what speakers actually say rather than what they're supposed to say. We develop and open-source extit{VCTK-accent}, a simulated dataset containing phonetic errors, and propose two novel metrics for assessing pronunciation differences. Our work establishes a new benchmark for phonetic error detection.