๐ค AI Summary
Existing reinforcement learning algorithms (e.g., GRPO, DAPO) employ sequence-level advantage estimation, assigning identical advantage values to all tokens within a responseโthus failing to capture token-specific contributions to final outcomes and limiting mathematical reasoning performance. This work proposes model-free Key Token Advantage Estimation (KTAE), the first token-level advantage decomposition method that requires no auxiliary discriminative model. KTAE quantifies the marginal contribution of each key token to reasoning-path success by leveraging rollout correctness statistics and rule-based rewards. Naturally integrated into GRPO/DAPO frameworks, KTAE achieves significant improvements across five mathematical reasoning benchmarks: generating shorter, more accurate responses and outperforming R1-Distill-Qwen-1.5B on the same Qwen-1.5B base model.
๐ Abstract
Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.