VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limited performance of large language models on low-resource languages, primarily caused by inefficient subword segmentation and imbalanced training data. To tackle these issues, the authors propose the Variable Entropy Policy Optimization (VEPO) framework, which integrates a variable entropy mechanism and asymmetric clipping into reinforcement learning to impose deterministic structural constraints. VEPO further incorporates a verifiable reward function that dynamically balances literal fidelity and semantic fluency, effectively preventing policy collapse, sustaining robust exploration, and ensuring output sequences adhere to required length, format, and linguistic norms. Experimental results demonstrate that VEPO significantly improves tokenization efficiency and translation quality across 90 evaluation directions—including FLORES-200, COMET-22, and chrF—substantially narrowing the performance gap for low-resource languages.

Technology Category

Application Category

📝 Abstract

Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

Problem

Research questions and friction points this paper is trying to address.

low-resource languages

subword segmentation

training data imbalance

language foundation models

translation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variable Entropy Policy Optimization

Reinforcement Learning with Verifiable Rewards

Structural Constraints