Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work investigates the cross-lingual generalization of test-time scaling (TTS) for multilingual mathematical reasoning. To this end, we introduce MCLM, the first competition-level benchmark covering 55 languages. Using Qwen2.5-1.5B Math and our newly developed multilingual model MR1-1.5B, we systematically evaluate key TTS methods—including Outcome/Process Reward Modeling (ORM/PRM) and Budget Forcing (BF). Results reveal a stark performance disparity: BF yields a +20-point gain on English AIME but only +1.94 points on average across the remaining 54 languages, exposing severe cross-lingual degradation and challenging the assumed universality of TTS. Our contributions include (1) the open-sourced MCLM benchmark, (2) the MR1-1.5B multilingual foundation model, and (3) comprehensive, reproducible evaluation results—collectively establishing a new standard for fair, empirically grounded assessment of multilingual reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although"thinking LLMs"have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

Problem

Research questions and friction points this paper is trying to address.

Evaluates test-time scaling in multilingual math reasoning.

Compares ORM, PRM, and BF methods across languages.

Assesses generalizability of scaling techniques in multilingual tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual math benchmark MCLM

Test-time scaling methods ORM, PRM, BF

Multilingual LLM MR1-1.5B for reasoning

🔎 Similar Papers

No similar papers found.