🤖 AI Summary
This paper investigates the fundamental mechanism underlying binary-reward-based reinforcement learning (RL) for reasoning optimization in large language models (LLMs).
Method: We establish a rigorous equivalence between such RL algorithms and stochastic gradient ascent on a monotonically transformed probability of generating the correct answer conditioned on the prompt—where rejection sampling corresponds to a logarithmic transformation and GRPO to an arcsine-square-root transformation.
Contribution/Results: Based on this insight, we propose the first unified mathematical framework that subsumes diverse reasoning optimization methods under a single theoretical paradigm. The framework formally unifies their implicit optimization objectives—namely, maximizing answer correctness probability—and reveals intrinsic connections among probabilistic modeling, gradient-based updates, and reward structure. It provides an interpretable, principled foundation for both the design and theoretical analysis of RL-based reasoning optimization algorithms.
📝 Abstract
We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as stochastic gradient ascent on a monotone transform of the probability of a correct answer given a prompt. In particular, the transformation associated with rejection sampling algorithms is the logarithm and that associated with the GRPO algorithm is the arcsine of the square root.