ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the prevailing reliance on English-centric reasoning in reinforcement learning–based post-training of large language models, which overlooks the potential of multilingual reasoning and the global demand for native-language reasoning trajectories. We propose ExpLang, a novel post-training framework that, for the first time, formulates the choice of reasoning language as a policy action within reinforcement learning, dynamically selecting among multiple languages during inference. This approach expands the exploration space and leverages the unique advantages of non-English languages. ExpLang is compatible with mainstream reinforcement learning algorithms and, under identical training budgets, consistently outperforms monolingual English-trained models, demonstrating superior reasoning capabilities and higher linguistic consistency across both seen and unseen languages.

Technology Category

Application Category

📝 Abstract

Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.

Problem

Research questions and friction points this paper is trying to address.

multilingual reasoning

large reasoning models

reinforcement learning

language selection

on-policy thinking

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy thinking language selection

multilingual reasoning

reinforcement learning