Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing reinforcement learning methods optimize only the single-sample pass rate (pass@1), leading to insufficient sampling diversity, suboptimal collective utility, and poor performance on complex tasks. This paper proposes Pass@k Policy Optimization (PKPO), the first method enabling robust joint optimization of pass@k for arbitrary k ≤ n. Key contributions include: (1) a low-variance, unbiased gradient estimator for pass@k; (2) a differentiable, numerically stable reward transformation function; and (3) in-training k-annealing to jointly improve pass@k while preserving pass@1 robustness. Built upon a policy gradient framework, PKPO unifies handling of binary and continuous rewards. Toy experiments confirm significantly reduced gradient variance. Empirical evaluation on Gemma-2 demonstrates that increasing k substantially improves success rates on hard problems and achieves sustained performance gains on challenging benchmark suites where conventional methods plateau.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k<= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

Problem

Research questions and friction points this paper is trying to address.

Optimizes pass@k performance in RL, not just pass@1

Enables robust optimization for any k <= n samples

Improves exploration by prioritizing joint sample utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Pass-at-k Policy Optimization (PKPO)

Derives unbiased estimators for pass@k

Enables annealing k during training

🔎 Similar Papers

No similar papers found.