🤖 AI Summary
Existing Pass@$k$ inference strategies—such as majority voting and Best-of-N—suffer from performance saturation or degradation under high-difficulty tasks as sampling size $N$ and target rank $k$ increase.
Method: We propose Best-of-Majority (BoM), a novel strategy that synergistically integrates the robustness of majority voting with the selection flexibility of Best-of-N. BoM constructs a high-coverage candidate set via reward model error analysis, frequency-based filtering, and theoretical modeling, then precisely selects the optimal top-$k$ responses.
Contribution/Results: BoM is the first Pass@$k$ method to achieve minimax-optimal regret bounds; its performance does not degrade with increasing $N$. Theoretically, it attains optimal convergence rates. Empirically, on mathematical reasoning benchmarks, BoM significantly outperforms baseline methods—especially in high-$k$ and large-$N$ regimes.
📝 Abstract
LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of- N (BoN). For difficult tasks, this single-shot selection often underperforms. Consequently, evaluations commonly report Pass@$k$: the agent may submit up to $k$ responses, and only the best of them is used when computing regret. Motivated by this, we study inference scaling in the more general Pass@$k$ inference setting, and prove that neither majority voting nor BoN exhibits the desirable scaling with $k$ and the sampling budget $N$. Combining the advantages of majority voting and BoN, we propose a new inference strategy called Best-of-Majority (BoM), with a pivotal step that restricts the candidates to the responses with high frequency in the $N$ samples before selecting the top-$k$ rewards. We prove that when the sampling budget is $N= ildeΩ(C^*)$, the regret of BoM is $O(ε_{mathrm{opt}}+sqrt{ε_{mathrm{RM}}^2C^*/k})$, where $C^*$ is the coverage coefficient, $ε_{mathrm{RM}}$ is the estimation error of the reward model, and $ε_{mathrm{opt}}$ is the estimation error of reward at the optimal response. We further establish a matching lower bound, certifying that our algorithm is minimax optimal. Beyond optimality, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing $N$. Experimental results of inference on math problems show BoM outperforming both majority voting and BoN.