MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the insufficient evaluation capability of existing benchmarks under multilingual and culturally diverse settings, this paper introduces MMLU-ProX—the first high-difficulty, multilingual benchmark explicitly designed to assess reasoning abilities across 13 typologically diverse languages (≈11,829 items per language). It employs a semi-automatic translation pipeline augmented by domain-expert validation to ensure conceptual fidelity, terminological consistency, and cultural appropriateness. Systematic evaluation across 25 state-of-the-art large language models reveals, for the first time, a substantial performance gap: mainstream models exhibit severe degradation on low-resource languages (e.g., Swahili accuracy ≈40%), markedly below their English performance (>70%). This quantifies the current multilingual capability gap and establishes MMLU-ProX as a new standard for fair, robust, and linguistically grounded multilingual model evaluation.

Technology Category

Application Category

📝 Abstract

Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluates multilingual capabilities of advanced language models.

Addresses gaps in traditional benchmarks for diverse languages.

Assesses performance across linguistic and cultural boundaries.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automatic translation using LLMs

Expert evaluation for accuracy and relevance

5-shot chain-of-thought prompting strategy

🔎 Similar Papers

No similar papers found.