HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the limitation of existing reinforcement learning approaches for large language models, which apply uniform optimization across all tokens and struggle to dynamically balance exploration and exploitation. To overcome this, the authors propose HTPO, a novel algorithm that introduces a hierarchical token-level objective control mechanism for the first time. HTPO categorizes response tokens into distinct groups based on prompt difficulty, answer correctness, and token entropy, and assigns each group a tailored optimization objective to enable fine-grained exploration-exploitation trade-offs. Integrated with a verifiable reward framework, HTPO outperforms the DAPO baseline by 8.6% and 6.7% on the challenging AIME'24 and AIME'25 reasoning benchmarks, respectively, with performance gains further amplified as test-time compute resources increase.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain-of-Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration-exploitation trade-off during learning. To this end, we propose Hierarchical Token-level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide-and-conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt difficulty, answer correctness, and token entropy). Within each group, according to the contributions to exploration or exploitation, we design specialized optimization objectives to facilitate the effective execution of each token's expected functionality. In this way, HTPO can achieve a more balanced exploration-exploitation trade-off. Extensive experiments on challenging reasoning benchmarks validate the superiority of our HTPO algorithm, which significantly outperforms the strong DAPO baseline (e.g., +8.6% and +6.7% on AIME'24 and AIME'25, respectively). When scaling test-time compute, the HTPO-trained model maintains a consistent performance advantage over the DAPO baseline, and the gap widens as the sampling budget increases, validating that our adaptive token-level control method fosters effective exploration without sacrificing exploitation performance. Code will be at https://github.com/xcyao00/HTPO.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Large Language Models
Exploration-Exploitation Trade-off
Token-level Optimization
Chain-of-Thought Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Token-level Control
Exploration-Exploitation Balance
Reinforcement Learning with Verifiable Rewards
Chain-of-Thought Reasoning
Policy Optimization