ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

๐Ÿ“… 2026-02-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the prevailing reliance on English-centric reasoning in reinforcement learningโ€“based post-training of large language models, which overlooks the potential of multilingual reasoning and the global demand for native-language reasoning trajectories. We propose ExpLang, a novel post-training framework that, for the first time, formulates the choice of reasoning language as a policy action within reinforcement learning, dynamically selecting among multiple languages during inference. This approach expands the exploration space and leverages the unique advantages of non-English languages. ExpLang is compatible with mainstream reinforcement learning algorithms and, under identical training budgets, consistently outperforms monolingual English-trained models, demonstrating superior reasoning capabilities and higher linguistic consistency across both seen and unseen languages.

Technology Category

Application Category

๐Ÿ“ Abstract
Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.
Problem

Research questions and friction points this paper is trying to address.

multilingual reasoning
large reasoning models
reinforcement learning
language selection
on-policy thinking
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy thinking language selection
multilingual reasoning
reinforcement learning
large reasoning models
language exploration and exploitation
๐Ÿ”Ž Similar Papers
No similar papers found.
Changjiang Gao
Changjiang Gao
PhD student, Nanjing University
Natural Language Processing
Zixian Huang
Zixian Huang
Shanghai AI Lab
Question AnsweringNatural Language Processing
K
Kaichen Yang
Shanghai Artificial Intelligence Laboratory, Shanghai, China; School of Mathematical Sciences, Dalian University of Technology, Liaoning, China
J
Jiajun Chen
National Key Laboratory for Novel Software Technology, Nanjing University, Jiangsu, China
J
Jixing Li
Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China
Shujian Huang
Shujian Huang
School of Computer Science, Nanjing University
Natural Language ProcessingMachine TranslationMultilingualismLarge Language Models