GRPO-$λ$: Credit Assignment improves LLM Reasoning

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

GRPO lacks explicit reward modeling or a critic network in RL-based fine-tuning of LLMs, hindering token-level fine-grained credit assignment and limiting complex mathematical reasoning capabilities. To address this, we propose GRPO-λ: a novel variant that restructures temporal credit assignment via the λ-return mechanism, integrating critic-free temporal-difference error approximation with token-level log-probability eligibility traces, and incorporating a multi-strategy weighted training framework. Evaluated on models ranging from 1.5B to 7B parameters, GRPO-λ improves training efficiency by 30–40% over standard GRPO. It achieves average score gains exceeding 3 points across multiple mathematical reasoning benchmarks—with the 7B model attaining a maximum improvement of 4.5 points. Notably, GRPO-λ is the first method to enable efficient, scalable, token-level credit assignment without requiring a critic network.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving their reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO's ability to assign fine-grained credit across token sequences. In this work, we present GRPO-$λ$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks. We approximate learning from $λ$-return with a reformulation of eligibility traces using token-level log-probabilities applied after each sequence generation, and a novel critic-free approximation of the temporal-difference error. We introduce a few variations for the weighting of the $λ$-return, and their applications to the eligibility-trace, where all the variations provide significant gains over GRPO. We compare GRPO-$λ$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets. The training plots demonstrate 30-40% improved performance during RL training on both LLaMA-3.1 and Qwen-2.5 architectures. Finally, we show that with GRPO-$λ$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.

Problem

Research questions and friction points this paper is trying to address.

Enhancing credit assignment in RL finetuning of LLMs

Improving reasoning abilities through better token-level rewards

Addressing limitations of existing methods in complex reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced credit assignment using eligibility traces

Novel critic-free temporal-difference error approximation

Improved RL finetuning with λ-return weighting variations

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting