MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

πŸ“… 2025-03-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the insufficient evaluation capability of existing benchmarks under multilingual and culturally diverse settings, this paper introduces MMLU-ProXβ€”the first high-difficulty, multilingual benchmark explicitly designed to assess reasoning abilities across 13 typologically diverse languages (β‰ˆ11,829 items per language). It employs a semi-automatic translation pipeline augmented by domain-expert validation to ensure conceptual fidelity, terminological consistency, and cultural appropriateness. Systematic evaluation across 25 state-of-the-art large language models reveals, for the first time, a substantial performance gap: mainstream models exhibit severe degradation on low-resource languages (e.g., Swahili accuracy β‰ˆ40%), markedly below their English performance (>70%). This quantifies the current multilingual capability gap and establishes MMLU-ProX as a new standard for fair, robust, and linguistically grounded multilingual model evaluation.

Technology Category

Application Category

πŸ“ Abstract
Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multilingual capabilities of advanced language models.
Addresses gaps in traditional benchmarks for diverse languages.
Assesses performance across linguistic and cultural boundaries.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automatic translation using LLMs
Expert evaluation for accuracy and relevance
5-shot chain-of-thought prompting strategy
πŸ”Ž Similar Papers
No similar papers found.
Weihao Xuan
Weihao Xuan
The University of Tokyo, RIKEN
Natural Language ProcessingComputer VisionMultimodal AIGenerative AILLM Agent
R
Rui Yang
Duke-NUS Medical School
Heli Qi
Heli Qi
Waseda University, RIKEN
Multi-Modal Learning
Qingcheng Zeng
Qingcheng Zeng
PhD Student in NLP, Northwestern University
Computational Social ScienceNLPComputational Linguistics
Yunze Xiao
Yunze Xiao
Language Technology Institute, Carnegie Mellon University
Natural Language ProcessingComputational Social ScienceAnthropomorphism
Yun Xing
Yun Xing
School of Computer Science and Engineering, Nanyang Technological University
Computer Vision
J
Junjue Wang
The University of Tokyo
Huitao Li
Huitao Li
Duke-Nus Medical School
Medical Informatics
X
Xin Li
Duke-NUS Medical School
K
Kunyu Yu
Duke-NUS Medical School
N
Nan Liu
Duke-NUS Medical School
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis
Douglas Teodoro
Douglas Teodoro
Professor, University of Geneva
biomedical NLPmachine learning for healthcaremedical informatics
Edison Marrese-Taylor
Edison Marrese-Taylor
National Institute of Advanced Industrial Science and Technology (AIST)
Natural Language Processing - Machine Learning
Shijian Lu
Shijian Lu
College of Computing and Data Science, NTU
Image and video analyticscomputer visionmachine learning
Yusuke Iwasawa
Yusuke Iwasawa
The University of Tokyo
deep learningtransfer learningfoundation modelmeta learning
Y
Yutaka Matsuo
The University of Tokyo
Irene Li
Irene Li
Project Lecturer (特任講師) at University of Tokyo
Large Language ModelsGraph Neural NetworksBioNLPMedical NLPText Summarization