🤖 AI Summary
This study addresses the limitations of current computer-assisted pronunciation training (CAPT) systems, which often provide feedback that is neither intuitive nor actionable for second-language learners. To overcome this, the authors propose the first application of instruction-tuned audio language models to interactive pronunciation training. They introduce L2-Arctic-plus, a novel dataset featuring fine-grained explanations of pronunciation errors alongside targeted improvement suggestions. The work further compares cascaded ASR + large language model pipelines against end-to-end audio language models in this context. Experimental results demonstrate that the instruction-tuned model significantly outperforms existing baselines in both pronunciation error detection and feedback generation, with improvements validated through both objective metrics and human evaluations. This advancement paves the way for CAPT systems that deliver more natural, interpretable, and actionable guidance to learners.
📝 Abstract
Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.