🤖 AI Summary
To address token-level reward sparsity in chain-of-thought (CoT) reasoning—which induces entropy collapse and model degradation—this paper proposes TEPO, a token-level policy optimization framework. TEPO introduces a novel sequence likelihood mechanism that refines group-level rewards to the token level via Markovian likelihood modeling, enabling differentiable and stable reward allocation. It further incorporates adaptive token-level entropy regularization to prevent global entropy collapse. By integrating the GRPO framework with token-granular policy updates, TEPO significantly improves training stability and inference consistency on mathematical reasoning tasks, achieving state-of-the-art performance on both @k metrics and accuracy. The core contributions are: (1) a sequence-likelihood-driven reward backpropagation mechanism from groups to tokens, and (2) a fine-grained entropy control strategy ensuring robust policy learning.
📝 Abstract
Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.