🤖 AI Summary
Existing mathematical reasoning benchmarks are heavily biased toward English and high-resource languages, lacking systematic evaluation of mathematical competence across multilingual—especially low-resource—settings.
Method: We introduce MLMath, the first parallel multilingual mathematical reasoning benchmark covering seven languages, comprising over 21,000 human-verified and automatically aligned question-answer pairs. We conduct systematic zero-shot, chain-of-thought, and code-switching evaluations on mainstream large language models (LLMs).
Contribution/Results: MLMath enables cross-lingual comparability across high-, medium-, and low-resource languages for the first time. Our evaluation reveals a substantial degradation (30–50% average drop) in LLMs’ mathematical reasoning performance on low-resource languages, exposing critical challenges in linguistic fairness and reasoning consistency. The benchmark provides a standardized, reproducible evaluation infrastructure to advance research in multilingual mathematical reasoning.
📝 Abstract
Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist