🤖 AI Summary
This work addresses entropy collapse and training instability in chain-of-thought reasoning caused by sparse token-level rewards. It establishes, for the first time, an explicit connection between group-level rewards and token-level optimization by introducing a sequence-level likelihood–based token reward aggregation mechanism. Furthermore, the authors design a dynamic KL divergence mask that applies regularization only to tokens with positive advantages and decreasing entropy, thereby enabling advantage-aware entropy regularization. This approach substantially enhances policy update stability and achieves state-of-the-art performance on mathematical reasoning benchmarks, demonstrating a 50% faster convergence rate compared to GRPO and DAPO.
📝 Abstract
Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.