🤖 AI Summary
Existing reinforcement learning methods optimize only the single-sample pass rate (pass@1), leading to insufficient sampling diversity, suboptimal collective utility, and poor performance on complex tasks. This paper proposes Pass@k Policy Optimization (PKPO), the first method enabling robust joint optimization of pass@k for arbitrary k ≤ n. Key contributions include: (1) a low-variance, unbiased gradient estimator for pass@k; (2) a differentiable, numerically stable reward transformation function; and (3) in-training k-annealing to jointly improve pass@k while preserving pass@1 robustness. Built upon a policy gradient framework, PKPO unifies handling of binary and continuous rewards. Toy experiments confirm significantly reduced gradient variance. Empirical evaluation on Gemma-2 demonstrates that increasing k substantially improves success rates on hard problems and achieves sustained performance gains on challenging benchmark suites where conventional methods plateau.
📝 Abstract
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k<= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.