🤖 AI Summary
Reinforcement learning (RL) fine-tuning often impairs the exploration capability of large language models (LLMs), reducing generation diversity and degrading Best-of-N sampling performance—particularly in mathematical and programming reasoning tasks.
Method: We propose the max@k optimization framework, the first to continuously relax the discrete pass@k metric and derive unbiased on-policy and off-policy gradient estimators, enabling end-to-end alignment between training objectives and multi-sample inference strategies. Our approach jointly leverages verifiable rewards and RL, preserving diversity while optimizing for max@k under offline policy evaluation.
Contribution/Results: Experiments demonstrate consistent improvements over standard RL fine-tuning across multiple reasoning benchmarks, significantly boosting offline max@k scores and large-scale sampling-based problem-solving capability. The framework establishes a new paradigm for alignment optimization in multi-generation settings.
📝 Abstract
The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.