🤖 AI Summary
This work addresses the challenge of limited automatic speech recognition (ASR) performance in low-resource languages due to scarce training data. It introduces task arithmetic to cross-lingual ASR for the first time by generating task vectors through fine-tuning the Whisper model on various languages and then linearly combining task vectors from high-resource languages to enhance ASR performance for low-resource target languages. The proposed approach significantly reduces word error rates across multiple low-resource languages, demonstrating the effectiveness of cross-lingual transfer and fusion of task vectors. This study establishes a novel, efficient, and scalable paradigm for improving ASR in low-resource settings.
📝 Abstract
The development of resource-constrained approaches to automatic speech recognition (ASR) is of great interest due to its broad applicability to many low-resource languages for which there is scant usable data. Existing approaches to many low-resource natural language processing tasks leverage additional data from higher-resource languages that are closely related to a target low-resource language. One increasingly popular approach uses task arithmetic to combine models trained on different tasks to create a model for a task where there is little to no training data. In this paper, we consider training on a particular language to be a task, and we generate task vectors by fine-tuning variants of the Whisper ASR system. For pairings of high- and low-resource languages, we merge task vectors via a linear combination, optimizing the weights of the linear combination on the downstream word error rate on the low-resource target language's validation set. We find that this approach consistently improves performance on the target languages.