🤖 AI Summary
This work addresses the diminishing marginal returns observed when repeatedly sampling from large language models under a fixed computational budget, which leads to slow growth in problem coverage. To improve coverage efficiency, the authors propose Reset-and-Discard (ReD), a method featuring an adaptive reset-and-discard mechanism. They establish, for the first time, a power-law relationship between pass@k and coverage@cost, and leverage this insight to design a general query strategy that requires no prior knowledge of pass@k and automatically infers the power-law exponent. Evaluated on the HumanEval benchmark, ReD significantly reduces the number of attempts, token consumption, and overall cost required to achieve target coverage across three mainstream large language models, while also offering a novel and efficient approach to measuring reasoning-related power laws.
📝 Abstract
The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for any given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs using HumanEval demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws.