🤖 AI Summary
This work investigates the cross-lingual generalization of test-time scaling (TTS) for multilingual mathematical reasoning. To this end, we introduce MCLM, the first competition-level benchmark covering 55 languages. Using Qwen2.5-1.5B Math and our newly developed multilingual model MR1-1.5B, we systematically evaluate key TTS methods—including Outcome/Process Reward Modeling (ORM/PRM) and Budget Forcing (BF). Results reveal a stark performance disparity: BF yields a +20-point gain on English AIME but only +1.94 points on average across the remaining 54 languages, exposing severe cross-lingual degradation and challenging the assumed universality of TTS. Our contributions include (1) the open-sourced MCLM benchmark, (2) the MR1-1.5B multilingual foundation model, and (3) comprehensive, reproducible evaluation results—collectively establishing a new standard for fair, empirically grounded assessment of multilingual reasoning capabilities.
📝 Abstract
Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although"thinking LLMs"have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.