From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Current LLM reinforcement learning optimization methods apply a uniform strategy across all tokens, ignoring their heterogeneous roles in reasoning. To address this, we propose HAPO—the first heterogeneous adaptive policy optimization framework tailored to token-level characteristics. Its core innovation lies in introducing token entropy as a dynamic control signal, enabling token-wise adaptive temperature sampling, group-mean advantage normalization, differentiated reward redistribution, and asymmetric gradient clipping. This end-to-end token-aware mechanism significantly enhances the fine-grained control and robustness of policy optimization. Experiments demonstrate that HAPO consistently outperforms DAPO across multi-scale models, achieving substantial improvements in both inference performance and training dynamics balance.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales. Our code can be found in https://github.com/starriver030515/HAPO.

Problem

Research questions and friction points this paper is trying to address.

Existing RL algorithms apply uniform optimization to all tokens

Current methods ignore token roles in reasoning process

Lack fine-grained control based on token characteristics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token-aware optimization using token entropy

Adaptive temperature sampling for rollout exploration

Asymmetric clipping for token-level loss control

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL