๐ค AI Summary
This work addresses the challenge of efficiently identifying the first correct output during inference while minimizing verification cost, particularly in settings that combine cheap reward signals with expensive verifiersโsuch as mathematical answer checking or hidden test cases in code generation. The problem is formalized as cost-sensitive first-positive search, and the authors propose ADAP, a distribution-agnostic yet near-optimal adaptive strategy. ADAP dynamically adjusts sampling and verification scales through dynamic programming and shell-wise incremental validation, requiring no prior knowledge of the underlying distribution. Under a monotonicity assumption, theoretical analysis shows its expected cost nearly matches that of the distribution-aware optimum, leveraging a lower bound based on the central star number. Empirical results demonstrate that ADAP substantially outperforms both fixed and difficulty-adaptive baselines in mathematical reasoning and competitive programming tasks, achieving significant reductions in verification cost.
๐ Abstract
Many inference-time language-model pipelines combine a cheap reward signal with an expensive verifier, such as exact answer checking in mathematical reasoning or hidden-test execution in code generation.
We formalize this setting using a learning-theoretic lens as generative active search: a cost-sensitive first-positive search problem in which a policy adaptively samples candidates from an unknown distribution, observes cheap scores, and pays for verifier labels until it finds a positive example. For a fixed prompt, the generator and reward model induce two unknown objects: a distribution over reward scores and a score-conditioned success function. When these quantities are known, we characterize the distribution-aware optimal policy using a dynamic programming approach. In the realistic and practical setting where both the score distribution and success function are unknown, we propose ADAP, a shellwise adaptive generate-rank-verify algorithm that progressively increases the number of sampled responses and top-ranked verifications. Under the monotonicity assumption that higher reward scores are no less likely to pass verification, we show that ADAP achieves expected cost within a constant factor of the distribution-aware optimum. We complement this result with learning-theoretic lower bounds, based on a centered star number, showing that structural assumptions on the score--label relationship are necessary. Experiments on mathematical reasoning and competitive programming validate the predicted advantage over both fixed non-adaptive policies and difficulty-adaptive baselines.