๐ค AI Summary
This work addresses the prevailing reliance on English-centric reasoning in reinforcement learningโbased post-training of large language models, which overlooks the potential of multilingual reasoning and the global demand for native-language reasoning trajectories. We propose ExpLang, a novel post-training framework that, for the first time, formulates the choice of reasoning language as a policy action within reinforcement learning, dynamically selecting among multiple languages during inference. This approach expands the exploration space and leverages the unique advantages of non-English languages. ExpLang is compatible with mainstream reinforcement learning algorithms and, under identical training budgets, consistently outperforms monolingual English-trained models, demonstrating superior reasoning capabilities and higher linguistic consistency across both seen and unseen languages.
๐ Abstract
Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.