🤖 AI Summary
This work proposes Maximum Likelihood Reinforcement Learning (MaxRL), a novel framework that unifies maximum likelihood estimation with reinforcement learning to address the limitations of traditional RL in sample-based binary feedback tasks, where only low-order approximations of the correct trajectory likelihood are typically optimized. MaxRL introduces a family of sampling objectives that smoothly interpolate between standard reinforcement learning and exact maximum likelihood estimation as computational resources increase, along with an unbiased policy gradient estimator. Empirical results demonstrate that MaxRL achieves Pareto superiority over existing methods across all evaluated tasks, yielding up to a 20-fold improvement in test-time scaling efficiency compared to GRPO, and exhibits superior data and computational scalability.
📝 Abstract
Reinforcement learning is the method of choice to train models in sampling-based setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower-order approximation. Inspired by this observation, we introduce Maximum Likelihood Reinforcement Learning (MaxRL), a sampling-based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non-differentiable sampling by defining a compute-indexed family of sample-based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy-gradient estimator and converge to maximum likelihood optimization in the infinite-compute limit. Empirically, we show that MaxRL Pareto-dominates existing methods in all models and tasks we tested, achieving up to 20x test-time scaling efficiency gains compared to its GRPO-trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.