Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

πŸ“… 2026-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

243K/year
πŸ€– AI Summary
Existing reinforcement learning fine-tuning methods struggle to precisely balance token-level exploration and exploitation due to their neglect of the directional dynamics and structural asymmetry in policy entropy changes. This work introduces the concept of β€œentropy polarity,” which captures, via a first-order approximation, whether policy updates expand or contract token-level entropy. Building on this insight, we propose Polarity-Aware Policy Optimization (PAPO), an algorithm that preserves both positive and negative entropy polarity branches and integrates advantage reweighting with online entropy trajectory phase signals to enable dynamic entropy control. Experimental results demonstrate that PAPO significantly outperforms baseline methods on mathematical reasoning and agent-based benchmarks, achieving superior training efficiency and reward performance.
πŸ“ Abstract
Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
Problem

Research questions and friction points this paper is trying to address.

policy entropy
reinforcement learning
token-level mechanism
entropy polarity
RLVR
Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy polarity
reinforcement fine-tuning
token-level entropy dynamics
asymmetry in policy updates
Polarity-Aware Policy Optimization
πŸ”Ž Similar Papers
2024-10-04Conference on Empirical Methods in Natural Language ProcessingCitations: 5
Jiazheng Zhang
Jiazheng Zhang
Fudan University
Large Language ModelNatural Language ProcessingData Mining
Z
Ziche Fu
Fudan NLP Group, Honor Device Co., Ltd
J
Junrui Shen
Fudan NLP Group, Honor Device Co., Ltd
Y
Yunbin Zhao
Fudan NLP Group, Honor Device Co., Ltd
Y
Yunke Zhang
Fudan NLP Group, Honor Device Co., Ltd
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
Long Ma
Long Ma
Dalian University of Technology
Computer VisionImage Processing
Chenxin An
Chenxin An
The University of Hong Kong
Long-context LLMs
Zhihao Zhang
Zhihao Zhang
Fudan University
Natural Language Processing
Shichun Liu
Shichun Liu
Fudan University
NLP
D
Dingwei Zhu
Fudan NLP Group, Honor Device Co., Ltd
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
S
Shaofan Liu
Fudan NLP Group, Honor Device Co., Ltd
H
Han Li
Fudan NLP Group, Honor Device Co., Ltd
W
Wiggin Zhou
Fudan NLP Group, Honor Device Co., Ltd
A
Aiden Adams
Fudan NLP Group, Honor Device Co., Ltd
T
Tao Gui
Fudan NLP Group, Honor Device Co., Ltd
F
Fei Huang
Fudan NLP Group, Honor Device Co., Ltd
Qi Zhang
Qi Zhang
Fudan University
SAGINsatellite routing
X
Xuanjing Huang
Fudan NLP Group, Honor Device Co., Ltd