ROC-n-reroll: How verifier imperfection affects test-time scaling

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how imperfect verifiers affect test-time scaling performance in language models. We theoretically analyze the local and global geometric properties of ROC curves to characterize the differential impact on Best-of-N and rejection sampling: Best-of-N exhibits extrapolable performance, whereas rejection sampling does not; yet both converge to identical accuracy under infinite computational budget. Leveraging instance-level accuracy modeling and generation-verification experiments with Llama and Qwen on GSM8K, we empirically validate the theoretical predictions. The study reveals that verifier imperfection fundamentally constrains inference efficiency, establishing an interpretable theoretical foundation for optimizing verifiers and designing scalable inference strategies aimed at computational efficiency.

Technology Category

Application Category

📝 Abstract
Test-time scaling aims to improve language model performance by leveraging additional compute during inference. While many works have empirically studied techniques like Best-of-N (BoN) and rejection sampling that make use of a verifier to enable test-time scaling, there is little theoretical understanding of how verifier imperfection affects performance. In this work, we address this gap. Specifically, we prove how instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Interestingly, while scaling is determined by the local geometry of the ROC curve for rejection sampling, it depends on global properties of the ROC curve for BoN. As a consequence when the ROC curve is unknown, it is impossible to extrapolate the performance of rejection sampling based on the low-compute regime. Furthermore, while rejection sampling outperforms BoN for fixed compute, in the infinite-compute limit both methods converge to the same level of accuracy, determined by the slope of the ROC curve near the origin. Our theoretical results are confirmed by experiments on GSM8K using different versions of Llama and Qwen to generate and verify solutions.
Problem

Research questions and friction points this paper is trying to address.

How verifier imperfection impacts test-time scaling performance
Relationship between ROC curve geometry and instance-level accuracy
Comparing rejection sampling and Best-of-N under infinite-compute limits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes verifier ROC curve geometry impact
Compares rejection sampling and Best-of-N scaling
Links ROC slope to accuracy limits
🔎 Similar Papers
No similar papers found.