🤖 AI Summary
Existing approaches to large language model policy optimization are often confined to a single response language or a fixed dominant language, limiting their ability to effectively leverage cross-lingual knowledge from multilingual corpora. This work proposes Language-Routed Policy Optimization (LRPO), a framework that treats language selection as an actionable choice within reinforcement learning. For each input, LRPO generates responses in multiple languages and integrates their relative quality through preference learning to update the policy. A trainable language router, modeled as a multi-armed bandit mechanism, adaptively balances exploration and exploitation to enhance the diversity and efficacy of training signals. Experimental results demonstrate that LRPO significantly improves model performance on multilingual tasks and effectively facilitates cross-lingual knowledge transfer.
📝 Abstract
Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.