๐ค AI Summary
This work addresses the poor cross-lingual transfer performance of multilingual models on low-resource languages by proposing XITE, a novel method that first matches unlabeled text in low-resource languages with labeled examples from high-resource languages (e.g., English) based on embedding similarity. It then generates synthetic training data via sourceโtarget embedding interpolation for fine-tuning and incorporates Linear Discriminant Analysis (LDA) to map target-language representations into a semantically richer subspace. XITE is the first approach to jointly leverage embedding interpolation and LDA for cross-lingual data augmentation, effectively mitigating catastrophic forgetting while substantially improving transfer performance. On sentiment analysis and natural language inference tasks, it achieves gains of up to 35.91% and 81.16% for languages including Korean, Arabic, Urdu, and Hindi, without compromising performance on high-resource languages.
๐ Abstract
Facilitating cross-lingual transfer in multilingual language models remains a critical challenge. Towards this goal, we propose an embedding-based data augmentation technique called XITE. We start with unlabeled text from a low-resource target language, identify an English counterpart in a task-specific training corpus using embedding-based similarities and adopt its label. Next, we perform a simple interpolation of the source and target embeddings to create synthetic data for task-specific fine-tuning. Projecting the target text into a language-rich subspace using linear discriminant analysis (LDA), prior to interpolation, further boosts performance. Our cross-lingual embedding-based augmentation technique XITE yields significant improvements of up to 35.91% for sentiment analysis and up to 81.16% for natural language inference, using XLM-R, for a diverse set of target languages including Korean, Arabic, Urdu and Hindi. Apart from boosting cross-lingual transfer, adaptation using XITE also safeguards against forgetting and maintains task performance on the high-resource language.