When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the problem of premature and invalid task release in black-box generate-and-verify AI workflows, which arises from adaptive evaluation and repeated monitoring. The authors propose a general-purpose release wrapper that requires neither likelihood models nor exchangeability assumptions. Their approach constructs a hard negative reference pool from high-scoring failure examples, converting black-box scores into conservative evidence, and accumulates this evidence via an e-process to enable valid decisions under optional stopping. This method achieves, for the first time, rigorous control over the release probability for infeasible tasks under finite-sample settings while preserving high release efficiency for feasible ones. Experiments on the MBPP+ code generation benchmark demonstrate a significant reduction in erroneous premature releases compared to baseline methods, without compromising reasonable task-completion release efficiency.

📝 Abstract

LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models or exchangeability assumptions typically used for calibration are unavailable. We propose an always-valid release wrapper for existing generator-evaluator pipelines. The wrapper builds a hard-negative reference pool of high-scoring failures, calibrates deployment-time evaluator scores against this pool, and accumulates the resulting evidence with an e-process. This separates two roles: the reference pool turns black-box scores into conservative evidence, while the e-process provides validity under optional stopping. In theory, we show that a conservative reference pool yields finite-sample control of the probability of releasing on infeasible tasks, that is, tasks for which the given workflow is not capable of producing a reliable solution. We also characterize conditions under which the same conservative rule still achieves nontrivial release on feasible tasks. In an MBPP+ coding-agent case study, the wrapper reduces premature incorrect release relative to baseline stopping rules while still releasing on tasks for which the workflow repeatedly accumulates moderate supporting evidence.

Problem

Research questions and friction points this paper is trying to address.

AI workflow release

always-valid inference

generate-verify systems

optional stopping

black-box evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

always-valid inference

e-process

black-box evaluation