Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses entropy collapse and training instability in chain-of-thought reasoning caused by sparse token-level rewards. It establishes, for the first time, an explicit connection between group-level rewards and token-level optimization by introducing a sequence-level likelihood–based token reward aggregation mechanism. Furthermore, the authors design a dynamic KL divergence mask that applies regularization only to tokens with positive advantages and decreasing entropy, thereby enabling advantage-aware entropy regularization. This approach substantially enhances policy update stability and achieves state-of-the-art performance on mathematical reasoning benchmarks, demonstrating a 50% faster convergence rate compared to GRPO and DAPO.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

Problem

Research questions and friction points this paper is trying to address.

token-level sparse rewards

entropy collapse

chain-of-thought reasoning

policy optimization

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-Level Policy Optimization

Sequence-Level Likelihood

KL-Divergence Mask