What is the objective of reasoning with reinforcement learning?

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the fundamental mechanism underlying binary-reward-based reinforcement learning (RL) for reasoning optimization in large language models (LLMs). Method: We establish a rigorous equivalence between such RL algorithms and stochastic gradient ascent on a monotonically transformed probability of generating the correct answer conditioned on the prompt—where rejection sampling corresponds to a logarithmic transformation and GRPO to an arcsine-square-root transformation. Contribution/Results: Based on this insight, we propose the first unified mathematical framework that subsumes diverse reasoning optimization methods under a single theoretical paradigm. The framework formally unifies their implicit optimization objectives—namely, maximizing answer correctness probability—and reveals intrinsic connections among probabilistic modeling, gradient-based updates, and reward structure. It provides an interpretable, principled foundation for both the design and theoretical analysis of RL-based reasoning optimization algorithms.

Technology Category

Application Category

📝 Abstract
We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as stochastic gradient ascent on a monotone transform of the probability of a correct answer given a prompt. In particular, the transformation associated with rejection sampling algorithms is the logarithm and that associated with the GRPO algorithm is the arcsine of the square root.
Problem

Research questions and friction points this paper is trying to address.

Analyzes reinforcement learning algorithms for language models
Connects binary reward methods to probability transformations
Compares rejection sampling and GRPO mathematical foundations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic gradient ascent on transformed probabilities
Logarithm transformation for rejection sampling
Arcsine square root transformation for GRPO
🔎 Similar Papers
No similar papers found.