🤖 AI Summary
This work addresses the limited performance of large language models on low-resource languages, primarily caused by inefficient subword segmentation and imbalanced training data. To tackle these issues, the authors propose the Variable Entropy Policy Optimization (VEPO) framework, which integrates a variable entropy mechanism and asymmetric clipping into reinforcement learning to impose deterministic structural constraints. VEPO further incorporates a verifiable reward function that dynamically balances literal fidelity and semantic fluency, effectively preventing policy collapse, sustaining robust exploration, and ensuring output sequences adhere to required length, format, and linguistic norms. Experimental results demonstrate that VEPO significantly improves tokenization efficiency and translation quality across 90 evaluation directions—including FLORES-200, COMET-22, and chrF—substantially narrowing the performance gap for low-resource languages.
📝 Abstract
Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.