Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing large reasoning models (LRMs) face two critical bottlenecks in multilingual settings: input–output language inconsistency and high error rates in non-English reasoning paths, leading to low answer accuracy. To address these issues, we propose M-Thinker—a novel framework that, for the first time, integrates explicit language consistency constraints and cross-lingual reasoning path alignment into a reinforcement learning (RL) paradigm, optimized iteratively via the Generalized Reward Policy Optimization (GRPO) algorithm. Our key contribution lies in designing dual reward signals—language consistency reward and cross-lingual reasoning alignment reward—enabling efficient transfer of strong English reasoning capabilities to non-English languages. Experiments on MMATH and PolyMath benchmarks demonstrate that M-Thinker-1.5B/7B achieves near-perfect (~100%) language consistency and substantially improves non-English reasoning accuracy, while exhibiting robust cross-lingual generalization.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the"think-then-answer"paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model's non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multilingual reasoning consistency in Large Reasoning Models

Improving non-English reasoning path quality and accuracy

Bridging performance gap between English and non-English languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GRPO algorithm with language consistency reward

Introduces cross-lingual thinking alignment reward mechanism

Transfers English reasoning capability to multilingual contexts

🔎 Similar Papers

No similar papers found.