🤖 AI Summary
Existing critic-free reinforcement learning methods, such as GRPO, rely on uniform credit assignment and struggle to identify critical steps within reasoning trajectories, resulting in suboptimal sample and token efficiency. This work proposes Selective Eligibility Traces (S-trace), which introduces P-trace—a trust-region-constrained credit assignment mechanism—into the critic-free framework for the first time, coupled with a sparse masking strategy that enables fine-grained credit allocation to low-entropy tokens. The approach unifies the GSPO theoretical framework and significantly outperforms GRPO across Qwen3 model variants: average pass@16 scores improve by 0.49%, 3.16%, and 2.98% on the 1.7B, 4B, and 8B models, respectively, while also enhancing training efficiency.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate a ``uniform credit assignment'' assumption that indiscriminately broadcast trajectory-level advantages, hindering learning efficiency by failing to distinguish critical reasoning steps. To address this limitation, we propose Selective Eligibility Traces (S-trace). Grounded in the intuition of partial trust region preservation, we initially introduce P-trace as a sample-efficient, critic-free eligibility traces method, upon which we build S-trace, implementing a sparse eligibility traces mechanism to further mitigate variance and achieve fine-grained credit assignment by selectively masking low-entropy tokens. Theoretically, we contextualize the recent Group Sequence Policy Optimization (GSPO) method within the critic-free eligibility traces framework, identifying it as a special instance of the eligibility traces method operating under uniform credit assignment. Experiments demonstrate that S-trace not only outperforms GRPO, showing gains of 0.49\% on Qwen3-1.7B and 3.16\% on Qwen3-4B, and maintaining a robust 2.98\% improvement when scaled further to Qwen3-8B in average pass@16, but notably achieves this with simultaneously higher sample and token efficiency.