🤖 AI Summary
This paper addresses the problem of reliably assessing action sequences (e.g., reasoning steps, tool calls) generated by intelligent agents. We propose the first online verification framework with provable error-rate control. Our method formulates trajectory success/failure discrimination as a sequential hypothesis test and leverages e-process theory to construct a model-agnostic statistical monitoring procedure. It transforms outputs from arbitrary black-box verifiers—such as LLM-based judges or process reward models—into real-time decision rules with a strict upper bound α on the false positive rate. The framework accommodates variable-length trajectories and enables early termination of anomalous sequences. We validate it across six benchmark datasets and three agent architectures, demonstrating significant improvements over baselines: it achieves exact false positive control (≤α) while enhancing statistical power and computational efficiency.
📝 Abstract
Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.