Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address token-level reward sparsity in chain-of-thought (CoT) reasoning—which induces entropy collapse and model degradation—this paper proposes TEPO, a token-level policy optimization framework. TEPO introduces a novel sequence likelihood mechanism that refines group-level rewards to the token level via Markovian likelihood modeling, enabling differentiable and stable reward allocation. It further incorporates adaptive token-level entropy regularization to prevent global entropy collapse. By integrating the GRPO framework with token-granular policy updates, TEPO significantly improves training stability and inference consistency on mathematical reasoning tasks, achieving state-of-the-art performance on both @k metrics and accuracy. The core contributions are: (1) a sequence-likelihood-driven reward backpropagation mechanism from groups to tokens, and (2) a fine-grained entropy control strategy ensuring robust policy learning.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.

Problem

Research questions and friction points this paper is trying to address.

Optimizing token-level policies to link group rewards with token aggregation

Addressing entropy collapse in chain-of-thought reasoning methods

Improving mathematical reasoning performance and training stability in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level policy optimization with Markov likelihood

Links group rewards to token aggregation

Enhances training stability and reasoning accuracy

🔎 Similar Papers

HAF-RM: A Hybrid Alignment Framework for Reward Model Training