Learning to Route Languages for Multilingual Policy Optimization

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to large language model policy optimization are often confined to a single response language or a fixed dominant language, limiting their ability to effectively leverage cross-lingual knowledge from multilingual corpora. This work proposes Language-Routed Policy Optimization (LRPO), a framework that treats language selection as an actionable choice within reinforcement learning. For each input, LRPO generates responses in multiple languages and integrates their relative quality through preference learning to update the policy. A trainable language router, modeled as a multi-armed bandit mechanism, adaptively balances exploration and exploitation to enhance the diversity and efficacy of training signals. Experimental results demonstrate that LRPO significantly improves model performance on multilingual tasks and effectively facilitates cross-lingual knowledge transfer.
📝 Abstract
Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.
Problem

Research questions and friction points this paper is trying to address.

multilingual policy optimization
language routing
large language models
preference-based learning
cross-lingual knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

language-routed policy optimization
multilingual reinforcement learning
trainable language router
preference-based policy update
multi-armed bandit
🔎 Similar Papers
No similar papers found.