IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

📅 2024-06-05

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses the evaluation and enhancement of large language model (LLM) capabilities for low-resource African languages. Focusing on natural language inference, mathematical reasoning, and multiple-choice knowledge question answering, we introduce IrokoBench—the first typologically balanced, human-translated, multi-task benchmark covering 17 African languages. We propose a novel test-set translation paradigm to systematically assess zero-shot and few-shot performance across 10 open-source and 6 closed-source LLMs. Our analysis reveals, for the first time, a performance compensation effect under test-set translation for English-centric models, quantifying a substantial gap between LLM performance on African languages versus English: average accuracy is only 60% of English performance, with Gemma 2 27B achieving merely 63% of GPT-4o’s score. Crucially, test-set translation significantly boosts performance of strong English-language models—including Gemma 2 27B and LLaMA 3.1 70B—on African languages.

Technology Category

Application Category

📝 Abstract

Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (eg African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based question answering~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and six proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63% of the best-performing proprietary model GPT-4o performance. In addition, machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, such as Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.

Problem

Research questions and friction points this paper is trying to address.

African Languages

Large Language Models

Performance Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

IrokoBench

African Languages

Model Performance Evaluation

🔎 Similar Papers

No similar papers found.