Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Existing evaluation methods struggle to quantify the reliability of AI agents under semantic-preserving perturbations and fail to disentangle task proficiency from strategic robustness. This work proposes the first testable definition of consistency and establishes a measurement science–grounded evaluation framework: it employs U-statistics to quantify output-level reliability and introduces a kernel-based trajectory-level consistency metric, substantially enhancing diagnostic sensitivity to agent failure modes. Experiments across three benchmarks demonstrate that this trajectory-level measure more effectively uncovers policy collapse than conventional pass@1 accuracy, offering a reliable foundation for deploying agents in high-stakes scenarios.

📝 Abstract

This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

Problem

Research questions and friction points this paper is trying to address.

AI agent reliability

consistency

semantic perturbations

execution robustness

trajectory-level stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

consistency

U-statistics

kernel-based metrics