Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing hybrid training methods for large language models, which blend supervised fine-tuning (SFT) and reinforcement learning at the sample level, often leading to insufficient exploration or catastrophic forgetting due to coarse-grained control over learning signals. To overcome this, we propose EGSPO, a three-stage framework that, after an SFT warm-up phase, dynamically modulates gradients at the token level based on prediction entropy: high-entropy tokens undergo full PPO updates to encourage exploration, while low-entropy tokens receive attenuated updates to preserve learned knowledge, guided by the advantage function to ensure consistent negative feedback. Our approach introduces, for the first time, a token-level entropy-gated mechanism for fine-grained gradient modulation, effectively balancing exploration with variance reduction and mitigating overconfident erroneous behaviors. Experiments show performance gains of 3.8% and 2.9% on the AIME and MATH benchmarks, respectively, with only a 3.4% increase in computational overhead.

Technology Category

Application Category

📝 Abstract
Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level. We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation. Stage 1, SFT expert learning, establishes a reliable warm up policy using expert demonstrations with a pure SFT loss. Stage 2, RL rollout generation, samples trajectories from the current policy and computes per token predictive entropy. Stage 3, the EGSPO mechanism, applies entropy gated gradient allocation: a predictive entropy module routes high entropy tokens to full PPO updates to encourage exploration, and low entropy tokens to attenuated PPO updates to reduce variance and preserve knowledge. Critically, both branches incorporate the advantage function A_t, ensuring that incorrect trajectories receive consistent negative learning signals and preventing reinforcement of confident errors. EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.
Problem

Research questions and friction points this paper is trying to address.

hybrid training
large language models
token-level gradient allocation
reinforcement learning
supervised fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

token-level gradient allocation
entropy-gated policy optimization
hybrid training
predictive entropy
PPO
🔎 Similar Papers
No similar papers found.