Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address token-level reward sparsity in chain-of-thought (CoT) reasoning—which induces entropy collapse and model degradation—this paper proposes TEPO, a token-level policy optimization framework. TEPO introduces a novel sequence likelihood mechanism that refines group-level rewards to the token level via Markovian likelihood modeling, enabling differentiable and stable reward allocation. It further incorporates adaptive token-level entropy regularization to prevent global entropy collapse. By integrating the GRPO framework with token-granular policy updates, TEPO significantly improves training stability and inference consistency on mathematical reasoning tasks, achieving state-of-the-art performance on both @k metrics and accuracy. The core contributions are: (1) a sequence-likelihood-driven reward backpropagation mechanism from groups to tokens, and (2) a fine-grained entropy control strategy ensuring robust policy learning.

Technology Category

Application Category

📝 Abstract
Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.
Problem

Research questions and friction points this paper is trying to address.

Optimizing token-level policies to link group rewards with token aggregation
Addressing entropy collapse in chain-of-thought reasoning methods
Improving mathematical reasoning performance and training stability in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level policy optimization with Markov likelihood
Links group rewards to token aggregation
Enhances training stability and reasoning accuracy
🔎 Similar Papers
No similar papers found.
X
Xingyu Lin
Baidu Inc; College of Computer Science and Technology, Jilin University; Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University
Yilin Wen
Yilin Wen
The University of Tokyo
Computer VisionRobotics
En Wang
En Wang
Professor, Jilin University
crowdsensing, block chain,artificial intelligence
Du Su
Du Su
Assistant Researcher, CAS Key Laboratory of AI Safety
AI safety
W
Wenbin Liu
College of Computer Science and Technology, Jilin University; Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University
C
Chenfu Bao
Baidu Inc
Z
Zhonghou Lv
Baidu Inc