Can Embedding Similarity Predict Cross-Lingual Transfer? A Systematic Study on African Languages

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the lack of reliable source language selection methods for cross-lingual transfer in low-resource African languages. Through a systematic evaluation of five embedding similarity metrics—cosine distance, P@1, CSLS, CKA, and others—across 816 cross-lingual transfer experiments spanning 12 African languages, three NLP tasks, and three Africa-centric multilingual models, the work demonstrates that cosine distance and retrieval-based metrics (P@1, CSLS) effectively predict transfer performance (Spearman’s ρ = 0.4–0.6), matching the predictive power of URIEL typological features. In contrast, CKA exhibits negligible predictive ability (ρ ≈ 0.1). The paper further presents the first direct comparison between embedding-based metrics and linguistic typology, uncovering a Simpson’s paradox when aggregating results across models, thereby underscoring the necessity of validating metric efficacy separately for each model.

Technology Category

Application Category

📝 Abstract

Cross-lingual transfer is essential for building NLP systems for low-resource African languages, but practitioners lack reliable methods for selecting source languages. We systematically evaluate five embedding similarity metrics across 816 transfer experiments spanning three NLP tasks, three African-centric multilingual models, and 12 languages from four language families. We find that cosine gap and retrieval-based metrics (P@1, CSLS) reliably predict transfer success ($\rho = 0.4-0.6$), while CKA shows negligible predictive power ($\rho \approx 0.1$). Critically, correlation signs reverse when pooling across models (Simpson's Paradox), so practitioners must validate per-model. Embedding metrics achieve comparable predictive power to URIEL linguistic typology. Our results provide concrete guidance for source language selection and highlight the importance of model-specific analysis.

Problem

Research questions and friction points this paper is trying to address.

cross-lingual transfer

low-resource languages

source language selection

African languages

embedding similarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding similarity

cross-lingual transfer

low-resource languages

Simpson's Paradox

source language selection

🔎 Similar Papers

No similar papers found.

Authors to Follow