AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

African languages have long suffered from severe scarcity of training data and evaluation resources for text embedding research. To address this, we introduce AfriMTEB—the first large-scale, comprehensive evaluation benchmark for African language text embeddings—covering 59 languages, 14 diverse tasks (including localized ones such as hate speech detection and intent classification), and 38 datasets. We further propose AfriE5, a dedicated embedding model built upon the instruction-tuned mE5 framework and enhanced with cross-lingual contrastive distillation to enable efficient adaptation to low-resource languages. Extensive experiments demonstrate that AfriE5 consistently outperforms strong baselines—including Gemini-Embeddings and mE5—across all AfriMTEB tasks, with particularly substantial gains on low-resource languages. AfriMTEB and AfriE5 jointly establish foundational infrastructure and a scalable technical paradigm for advancing NLP research and applications across African languages.

Technology Category

Application Category

📝 Abstract

Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB -- a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking text embedding models for underrepresented African languages

Introducing AfriMTEB with 59 languages and 14 novel tasks

Adapting instruction-tuned models via cross-lingual contrastive distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expanded multilingual benchmark for African languages

Adapted embedding model via cross-lingual contrastive distillation

Introduced new tasks like hate speech detection

🔎 Similar Papers

No similar papers found.