🤖 AI Summary
GRPO lacks explicit reward modeling or a critic network in RL-based fine-tuning of LLMs, hindering token-level fine-grained credit assignment and limiting complex mathematical reasoning capabilities. To address this, we propose GRPO-λ: a novel variant that restructures temporal credit assignment via the λ-return mechanism, integrating critic-free temporal-difference error approximation with token-level log-probability eligibility traces, and incorporating a multi-strategy weighted training framework. Evaluated on models ranging from 1.5B to 7B parameters, GRPO-λ improves training efficiency by 30–40% over standard GRPO. It achieves average score gains exceeding 3 points across multiple mathematical reasoning benchmarks—with the 7B model attaining a maximum improvement of 4.5 points. Notably, GRPO-λ is the first method to enable efficient, scalable, token-level credit assignment without requiring a critic network.
📝 Abstract
Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving their reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO's ability to assign fine-grained credit across token sequences. In this work, we present GRPO-$λ$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks. We approximate learning from $λ$-return with a reformulation of eligibility traces using token-level log-probabilities applied after each sequence generation, and a novel critic-free approximation of the temporal-difference error. We introduce a few variations for the weighting of the $λ$-return, and their applications to the eligibility-trace, where all the variations provide significant gains over GRPO. We compare GRPO-$λ$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets. The training plots demonstrate 30-40% improved performance during RL training on both LLaMA-3.1 and Qwen-2.5 architectures. Finally, we show that with GRPO-$λ$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.