AfroBench: How Good are Large Language Models on African Languages?

📅 2023-11-14

📈 Citations: 15

✨ Influential: 1

career value

172K/year

🤖 AI Summary

This study addresses the critical gaps in large language model (LLM) evaluation for African languages—namely, the absence of standardized benchmarks and the scarcity of high-quality, accessible linguistic data. To this end, we introduce AfriBench, the first comprehensive multilingual evaluation benchmark covering 64 African languages, 15 diverse NLP tasks, and 22 curated datasets. Methodologically, we establish a unified preprocessing and evaluation protocol supporting zero-shot, few-shot, and fine-tuning paradigms, and integrate both conventional fine-tuned BERT/T5 baselines and modern prompt-engineered LLMs for fair comparison. Our key contributions are threefold: (1) the first large-scale, multi-granular, cross-task systematic evaluation framework for African-language LLMs; (2) empirical confirmation of a strong positive correlation between language resource size and model performance, identifying data scarcity as the primary bottleneck; and (3) evidence that state-of-the-art LLMs exhibit substantially lower capabilities in natural language understanding, generation, question answering, and mathematical reasoning for African languages compared to English.

📝 Abstract

Large-scale multilingual evaluations, such as MEGA, often include only a handful of African languages due to the scarcity of high-quality evaluation data and the limited discoverability of existing African datasets. This lack of representation hinders comprehensive LLM evaluation across a diverse range of languages and tasks. To address these challenges, we introduce AfroBench -- a multi-task benchmark for evaluating the performance of LLMs across 64 African languages, 15 tasks and 22 datasets. AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task. We present results comparing the performance of prompting LLMs to fine-tuned baselines based on BERT and T5-style models. Our results suggest large gaps in performance between high-resource languages, such as English, and African languages across most tasks; but performance also varies based on the availability of monolingual data resources. Our findings confirm that performance on African languages continues to remain a hurdle for current LLMs, underscoring the need for additional efforts to close this gap. https://mcgill-nlp.github.io/AfroBench/

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on African languages

Addresses scarcity of African language data

Identifies performance gaps in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

AfroBench: Multi-task benchmark

Evaluates 64 African languages

Compares prompting vs fine-tuning LLMs

🔎 Similar Papers

No similar papers found.