🤖 AI Summary
This work addresses the problem of premature and invalid task release in black-box generate-and-verify AI workflows, which arises from adaptive evaluation and repeated monitoring. The authors propose a general-purpose release wrapper that requires neither likelihood models nor exchangeability assumptions. Their approach constructs a hard negative reference pool from high-scoring failure examples, converting black-box scores into conservative evidence, and accumulates this evidence via an e-process to enable valid decisions under optional stopping. This method achieves, for the first time, rigorous control over the release probability for infeasible tasks under finite-sample settings while preserving high release efficiency for feasible ones. Experiments on the MBPP+ code generation benchmark demonstrate a significant reduction in erroneous premature releases compared to baseline methods, without compromising reasonable task-completion release efficiency.
📝 Abstract
LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models or exchangeability assumptions typically used for calibration are unavailable. We propose an always-valid release wrapper for existing generator-evaluator pipelines. The wrapper builds a hard-negative reference pool of high-scoring failures, calibrates deployment-time evaluator scores against this pool, and accumulates the resulting evidence with an e-process. This separates two roles: the reference pool turns black-box scores into conservative evidence, while the e-process provides validity under optional stopping. In theory, we show that a conservative reference pool yields finite-sample control of the probability of releasing on infeasible tasks, that is, tasks for which the given workflow is not capable of producing a reliable solution. We also characterize conditions under which the same conservative rule still achieves nontrivial release on feasible tasks. In an MBPP+ coding-agent case study, the wrapper reduces premature incorrect release relative to baseline stopping rules while still releasing on tasks for which the workflow repeatedly accumulates moderate supporting evidence.