🤖 AI Summary
This work addresses the challenge of balancing low-variance policy gradient estimation and computational efficiency in reinforcement learning for large language models under resource-constrained settings. The authors introduce, for the first time, nonparametric statistical techniques—specifically kernel smoothing—to reinforcement learning with large language models, enabling efficient advantage function estimation. This approach achieves low-variance policy gradient optimization using only a small number of inference trajectories. Both theoretical analysis and empirical experiments demonstrate that the proposed method significantly improves policy optimization performance while substantially reducing computational and memory overhead, thereby striking an effective trade-off between sample efficiency and computational cost.
📝 Abstract
Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three approaches have been widely adopted: (i) Proximal policy optimization and advantage actor-critic rely on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. (ii) Group relative policy optimization (GRPO) avoids training a value network by approximating the value function using sample averages. However, GRPO samples a large number of reasoning traces per prompt to achieve accurate value function approximation, making it computationally expensive. (iii) REINFORCE-type algorithms sample only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency.
In this work, we focus on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.