🤖 AI Summary
This work addresses the challenge of automatic speech recognition (ASR) for low-resource languages such as Taiwanese Hokkien, where performance is severely constrained by the scarcity of labeled speech data—particularly in scenarios involving videos with subtitles only in a related auxiliary language like Mandarin. To tackle this, the authors propose a translation-guided end-to-end ASR framework that adaptively integrates semantic information from the auxiliary language through multilingual translation embeddings and a novel Parallel Gated Cross-Attention (PGCA) mechanism. This approach provides strong semantic guidance while mitigating cross-lingual interference. Combined with translation-guided learning and multilingual embedding alignment strategies, the method achieves a relative character error rate reduction of 14.77% on the authors’ newly curated 30-hour Taiwanese Hokkien drama corpus, YT-THDC, substantially advancing ASR performance under low-resource conditions.
📝 Abstract
Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.