🤖 AI Summary
Existing large reasoning models (LRMs) face two critical bottlenecks in multilingual settings: input–output language inconsistency and high error rates in non-English reasoning paths, leading to low answer accuracy. To address these issues, we propose M-Thinker—a novel framework that, for the first time, integrates explicit language consistency constraints and cross-lingual reasoning path alignment into a reinforcement learning (RL) paradigm, optimized iteratively via the Generalized Reward Policy Optimization (GRPO) algorithm. Our key contribution lies in designing dual reward signals—language consistency reward and cross-lingual reasoning alignment reward—enabling efficient transfer of strong English reasoning capabilities to non-English languages. Experiments on MMATH and PolyMath benchmarks demonstrate that M-Thinker-1.5B/7B achieves near-perfect (~100%) language consistency and substantially improves non-English reasoning accuracy, while exhibiting robust cross-lingual generalization.
📝 Abstract
Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the"think-then-answer"paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model's non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.