VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited performance of large language models on low-resource languages, primarily caused by inefficient subword segmentation and imbalanced training data. To tackle these issues, the authors propose the Variable Entropy Policy Optimization (VEPO) framework, which integrates a variable entropy mechanism and asymmetric clipping into reinforcement learning to impose deterministic structural constraints. VEPO further incorporates a verifiable reward function that dynamically balances literal fidelity and semantic fluency, effectively preventing policy collapse, sustaining robust exploration, and ensuring output sequences adhere to required length, format, and linguistic norms. Experimental results demonstrate that VEPO significantly improves tokenization efficiency and translation quality across 90 evaluation directions—including FLORES-200, COMET-22, and chrF—substantially narrowing the performance gap for low-resource languages.

Technology Category

Application Category

📝 Abstract
Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
subword segmentation
training data imbalance
language foundation models
translation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variable Entropy Policy Optimization
Reinforcement Learning with Verifiable Rewards
Structural Constraints
Entropy-Tempered Advantage Estimation
Low-Resource Language Modeling
🔎 Similar Papers
No similar papers found.
C
Chonghan Liu
Qiyuan Tech
Y
Yimin Du
Q
Qi An
X
Xin He
C
Cunqi Zhai
Fei Tan
Fei Tan
Associate Professor, East China Normal University
NLPData MiningNetwork Science
W
Weijia Lin
X
Xiaochun Gong
Y
Yongchao Deng
Shousheng Jia
Shousheng Jia
360
llmnlpdeep retrieval
Xiangzheng Zhang
Xiangzheng Zhang
360
AI safetyLarge language modelsInformation Retrieval