MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multilingual mathematical reasoning benchmarks lack sufficient difficulty and diversity in low-resource languages, limiting their ability to evaluate model robustness to variations in numerical instances. To address this gap, this work extends the MGSM dataset by generating five variants per problem using the GSM-Symbolic approach, each incorporating distinct names, numbers, and irrelevant contextual details. This yields the first multilingual robustness evaluation benchmark spanning nine languages and introduces a new evaluation paradigm requiring models to be assessed across at least five such variants for more reliable performance measurement. Experimental results reveal that Gemini 2.5 Flash and GPT-4.1 are highly sensitive to numerical perturbations, whereas Claude 4.0 Sonnet, GPT-OSS 120B, and DeepSeek V3 demonstrate greater robustness. Notably, performance significantly degrades in low-resource languages across all models.

Technology Category

Application Category

📝 Abstract
Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.
Problem

Research questions and friction points this paper is trying to address.

multilingual mathematical reasoning
evaluation robustness
digit instantiation
low-resource languages
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual mathematical reasoning
robustness evaluation
digit instantiation
MGSM-Pro
language model benchmarking
🔎 Similar Papers
No similar papers found.
Tianyi Xu
Tianyi Xu
Tulane University
Reinforcement LearningNetwork OptimizaitonStatisticsNLP(LLM)Operations research
K
Kosei Uemura
Mila-Quebec AI Institute, University of Toronto
A
A. Kondoro
Hanyang University, Rep. of Korea, Masakhane
Tadesse Destaw Belay
Tadesse Destaw Belay
Ph.D. candidate IPN, Mexico
NLP for Low-resource languagesMachine learningand LLMs
C
Catherine Nana Nyaah Essuman
Umbaji, Masakhane
I
Ifeoma Okoh
Masakhane
G
Ganiyat Afolabi
University of Ibadan, Nigeria, McPherson University, Nigeria, Masakhane
Ayodele Awokoya
Ayodele Awokoya
McPherson University, University of Ibadan
Machine LearningNatural Language ProcessingMachine TranslationComputational Linguistics
D
D. I. Adelani
McGill University, Mila-Quebec AI Institute, Canada CIFAR AI Chair