Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing reinforcement learning fine-tuning methods struggle to precisely balance token-level exploration and exploitation due to their neglect of the directional dynamics and structural asymmetry in policy entropy changes. This work introduces the concept of “entropy polarity,” which captures, via a first-order approximation, whether policy updates expand or contract token-level entropy. Building on this insight, we propose Polarity-Aware Policy Optimization (PAPO), an algorithm that preserves both positive and negative entropy polarity branches and integrates advantage reweighting with online entropy trajectory phase signals to enable dynamic entropy control. Experimental results demonstrate that PAPO significantly outperforms baseline methods on mathematical reasoning and agent-based benchmarks, achieving superior training efficiency and reward performance.

📝 Abstract

Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.

Problem

Research questions and friction points this paper is trying to address.

policy entropy

reinforcement learning

token-level mechanism

entropy polarity

RLVR

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy polarity

reinforcement fine-tuning

token-level entropy dynamics