Value-Gradient Hypothesis of RL for LLMs

๐Ÿ“… 2026-05-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

181K/year
๐Ÿค– AI Summary
This work investigates the mechanistic underpinnings and applicability limits of critic-free reinforcement learning methodsโ€”such as PPO and GRPOโ€”in the post-training of large language models. By introducing a value-gradient hypothesis, the study reveals that policy updates are equivalent in expectation to value-gradient ascent, and proposes leveraging attention mechanisms combined with automatic differentiation to approximate this signal. Theoretical analysis demonstrates that policy updates can effectively approximate the true value gradient, while empirical results show that attention-derived cost states serve as a viable proxy for value signals, with approximation error governed by sampling intervals and policy entropy. Furthermore, the paper establishes a criterion linking RL efficacy to the potential for reward improvement along pretraining trajectories, thereby clarifying the conditions under which critic-free approaches succeed.
๐Ÿ“ Abstract
Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
critic-free methods
value-gradient
LLM post-training
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

value-gradient
critic-free RL
differentiable rollout
autodifferentiation through attention
reward headroom