Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the performance saturation of large language models in reinforcement learning caused by entropy collapse. To mitigate this issue, the authors propose Entrocraft, a novel method that enables precise control over user-specified entropy decay schedules for the first time. Built upon rejection sampling, Entrocraft establishes a target-free regularized reinforcement learning framework compatible with arbitrary advantage estimators and theoretically characterizes the relationship between per-step entropy dynamics and the advantage distribution. Experimental results demonstrate that Entrocraft substantially delays performance saturation, allowing a 4B-parameter model to surpass an 8B baseline, extends training convergence time by fourfold, improves pass@K by 50%, and enhances both generalization and output diversity. Both theoretical analysis and empirical evidence confirm that linear entropy annealing is optimal.

📝 Abstract

Reinforcement learning (RL) has unlocked complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing further gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts have tried to prevent entropy collapse through regularization or clipping, but their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes any user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions, which explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, where we find that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.

Problem

Research questions and friction points this paper is trying to address.

performance saturation

entropy collapse

reinforcement learning

large language models

exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy control

rejection sampling

reinforcement learning