PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Evaluating and improving multilingual large language models (LLMs) on mathematical reasoning remains challenging due to the lack of rigorous, fine-grained benchmarks spanning diverse languages and difficulty levels. Method: We introduce PolyMath, a high-quality benchmark covering 18 languages and four difficulty tiers. We systematically identify three key phenomena in multilingual mathematical reasoning: (i) substantial cross-lingual performance disparities, (ii) accuracy degradation under input–output language misalignment, and (iii) language-dependent chain-of-thought length requirements; we further demonstrate that explicit output-language control improves low-resource language performance. Results: State-of-the-art models—including DeepSeek-R1-671B and Qwen-QwQ-32B—achieve less than 30% accuracy on PolyMath’s hardest tier. PolyMath effectively discriminates model capabilities and precisely exposes core bottlenecks in multilingual mathematical reasoning, providing a reproducible, fine-grained evaluation framework for future research.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Deepseek-R1-671B and Qwen-QwQ-32B, achieve only 43.4 and 41.8 benchmark scores, with less than 30% accuracy under the highest level. From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual mathematical reasoning across 18 languages

Assessing difficulty levels and language diversity in reasoning LLMs

Identifying challenges in multilingual consistency and performance variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark with 18 languages

Four difficulty levels for comprehensive evaluation

Output language control affects reasoning performance

🔎 Similar Papers

No similar papers found.

Authors to Follow