Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large reasoning models (LRMs) face two critical bottlenecks in multilingual settings: input–output language inconsistency and high error rates in non-English reasoning paths, leading to low answer accuracy. To address these issues, we propose M-Thinker—a novel framework that, for the first time, integrates explicit language consistency constraints and cross-lingual reasoning path alignment into a reinforcement learning (RL) paradigm, optimized iteratively via the Generalized Reward Policy Optimization (GRPO) algorithm. Our key contribution lies in designing dual reward signals—language consistency reward and cross-lingual reasoning alignment reward—enabling efficient transfer of strong English reasoning capabilities to non-English languages. Experiments on MMATH and PolyMath benchmarks demonstrate that M-Thinker-1.5B/7B achieves near-perfect (~100%) language consistency and substantially improves non-English reasoning accuracy, while exhibiting robust cross-lingual generalization.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the"think-then-answer"paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model's non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multilingual reasoning consistency in Large Reasoning Models
Improving non-English reasoning path quality and accuracy
Bridging performance gap between English and non-English languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GRPO algorithm with language consistency reward
Introduces cross-lingual thinking alignment reward mechanism
Transfers English reasoning capability to multilingual contexts
🔎 Similar Papers
No similar papers found.
X
Xue Zhang
Key Laboratory of Big Data & Artificial Intelligence in Transportation, Beijing Jiaotong University, Ministry of Education; School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China
Yunlong Liang
Yunlong Liang
WeChat
Natural Language Processing (NLP)
Fandong Meng
Fandong Meng
WeChat AI, Tencent
Machine TranslationNatural Language Processing
Songming Zhang
Songming Zhang
Beijing Jiaotong University
natural language processingtext generationmachine translation
K
Kaiyu Huang
Key Laboratory of Big Data & Artificial Intelligence in Transportation, Beijing Jiaotong University, Ministry of Education; School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China
Y
Yufeng Chen
Key Laboratory of Big Data & Artificial Intelligence in Transportation, Beijing Jiaotong University, Ministry of Education; School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China
Jinan Xu
Jinan Xu
Professor of School of Computer and Information Technology, Beijing Jiaotong University
NLPMachine TranslationLLM
J
Jie Zhou
Pattern Recognition Center, WeChat AI, Tencent Inc, China