XITE: Cross-lingual Interpolation for Transfer using Embeddings

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

This work addresses the poor cross-lingual transfer performance of multilingual models on low-resource languages by proposing XITE, a novel method that first matches unlabeled text in low-resource languages with labeled examples from high-resource languages (e.g., English) based on embedding similarity. It then generates synthetic training data via source–target embedding interpolation for fine-tuning and incorporates Linear Discriminant Analysis (LDA) to map target-language representations into a semantically richer subspace. XITE is the first approach to jointly leverage embedding interpolation and LDA for cross-lingual data augmentation, effectively mitigating catastrophic forgetting while substantially improving transfer performance. On sentiment analysis and natural language inference tasks, it achieves gains of up to 35.91% and 81.16% for languages including Korean, Arabic, Urdu, and Hindi, without compromising performance on high-resource languages.

Technology Category

Application Category

📝 Abstract

Facilitating cross-lingual transfer in multilingual language models remains a critical challenge. Towards this goal, we propose an embedding-based data augmentation technique called XITE. We start with unlabeled text from a low-resource target language, identify an English counterpart in a task-specific training corpus using embedding-based similarities and adopt its label. Next, we perform a simple interpolation of the source and target embeddings to create synthetic data for task-specific fine-tuning. Projecting the target text into a language-rich subspace using linear discriminant analysis (LDA), prior to interpolation, further boosts performance. Our cross-lingual embedding-based augmentation technique XITE yields significant improvements of up to 35.91% for sentiment analysis and up to 81.16% for natural language inference, using XLM-R, for a diverse set of target languages including Korean, Arabic, Urdu and Hindi. Apart from boosting cross-lingual transfer, adaptation using XITE also safeguards against forgetting and maintains task performance on the high-resource language.

Problem

Research questions and friction points this paper is trying to address.

cross-lingual transfer

low-resource languages

multilingual language models

data augmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual transfer

embedding interpolation

data augmentation

linear discriminant analysis