Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study addresses a critical limitation in existing cross-lingual transfer approaches: the lack of control over total training data when selecting source languages, which conflates the effects of language choice and data volume, thereby hindering effective support for low-resource African languages in NLP tasks. To resolve this, the work formulates multi-source cross-lingual transfer as a resource allocation problem under a fixed annotation budget, jointly optimizing source language selection and data distribution ratios. Using mBERT and XLM-R, the authors conduct 288 experiments on Hausa, Yoruba, and Swahili to systematically evaluate four strategies across named entity recognition (NER) and sentiment analysis. Results demonstrate that multi-source transfer significantly outperforms single-source transfer (Cohen’s d = 0.80–1.98), differences among allocation strategies are marginal, and the efficacy of embedding similarity as a proxy for source selection is task-dependent—random selection excels in NER, whereas similarity-based selection performs better in sentiment analysis.

Technology Category

Application Category

📝 Abstract

Cross-lingual transfer learning enables NLP for low-resource languages by leveraging labeled data from higher-resource sources, yet existing comparisons of source language selection strategies do not control for total training data, confounding language selection effects with data quantity effects. We introduce Budget-Xfer, a framework that formulates multi-source cross-lingual transfer as a budget-constrained resource allocation problem. Given a fixed annotation budget B, our framework jointly optimizes which source languages to include and how much data to allocate from each. We evaluate four allocation strategies across named entity recognition and sentiment analysis for three African target languages (Hausa, Yoruba, Swahili) using two multilingual models, conducting 288 experiments. Our results show that (1) multi-source transfer significantly outperforms single-source transfer (Cohen's d = 0.80 to 1.98), driven by a structural budget underutilization bottleneck; (2) among multi-source strategies, differences are modest and non-significant; and (3) the value of embedding similarity as a selection proxy is task-dependent, with random selection outperforming similarity-based selection for NER but not sentiment analysis.

Problem

Research questions and friction points this paper is trying to address.

cross-lingual transfer

low-resource languages

source language selection

budget constraint

African languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

budget-constrained transfer

cross-lingual transfer

source language selection