🤖 AI Summary
This work addresses the performance degradation of automatic speech recognition (ASR) systems on non-native accented speech by proposing a modeling approach grounded in the Interlanguage Speech Intelligibility Benefit (ISIB) hypothesis. The method integrates multilingual multitask learning—leveraging both the speaker’s native language (L1) and the target second language (L2)—with differentiable K-means clustering in a self-supervised speech representation space to produce accent-robust discrete phoneme-like tokens for ASR training. Notably, this is the first approach to jointly optimize differentiable clustering and L1–L2 multitask learning in an end-to-end framework. The model demonstrates strong generalization: it outperforms baseline systems using only native-language data and achieves approximately a 20% relative improvement in recognition accuracy when supplemented with a small amount of accented speech data.
📝 Abstract
Building ASR systems robust to foreign-accented speech is an important challenge in today's globalized world. A prior study explored the way to enhance the performance of phonetic token-based ASR on accented speech by reproducing the phenomenon known as interlanguage speech intelligibility benefit (ISIB), where foreign-accented speech is more intelligible to listeners sharing the speaker's native language than to native listeners. ISIB was technically implemented by using the speaker's L1 to learn k-means cluster centroids in an SSL feature space to obtain phonetic tokens. In this study, we propose a more advanced modeling of ISIB. By employing differentiable k-means and optimizing the entire module for both L1 and L2 ASR, the proposed method outperformed the baselines, both when using only native speech and when additionally incorporating a limited amount of accented speech. Notably, in the latter scenario, our method achieved approximately a 20% relative improvement in recognition accuracy.