The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Reinforcement learning (RL) fine-tuning often impairs the exploration capability of large language models (LLMs), reducing generation diversity and degrading Best-of-N sampling performance—particularly in mathematical and programming reasoning tasks. Method: We propose the max@k optimization framework, the first to continuously relax the discrete pass@k metric and derive unbiased on-policy and off-policy gradient estimators, enabling end-to-end alignment between training objectives and multi-sample inference strategies. Our approach jointly leverages verifiable rewards and RL, preserving diversity while optimizing for max@k under offline policy evaluation. Contribution/Results: Experiments demonstrate consistent improvements over standard RL fine-tuning across multiple reasoning benchmarks, significantly boosting offline max@k scores and large-scale sampling-based problem-solving capability. The framework establishes a new paradigm for alignment optimization in multi-generation settings.

Technology Category

Application Category

📝 Abstract

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

Problem

Research questions and friction points this paper is trying to address.

Optimizing max@k metric to align RL with Best-of-N sampling

Addressing diversity loss in RL fine-tuning for large N values

Developing unbiased gradient estimates for on-policy and off-policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes max@k metric via unbiased gradient estimate

Extends to off-policy updates for efficiency

Aligns reinforcement learning with Best-of-N sampling

🔎 Similar Papers

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach