๐ค AI Summary
Current evaluations of AI agents overemphasize task success rates while neglecting critical reliability dimensions such as consistency, robustness, predictability, and safety, leading to a disconnect between reported performance and real-world behavior. This work proposes the first comprehensive evaluation framework grounded in safety-critical engineering principles, systematically decomposing reliability into these four dimensions and defining twelve quantifiable metrics. Through multidimensional metric design, cross-model benchmarking, perturbation analysis, and failure mode assessment, we conduct an empirical study of fourteen mainstream agents across two complementary benchmarks. Our findings reveal that despite continuous improvements in model capabilities, reliability has only marginally increased, exposing significant vulnerabilities to operational fluctuations, environmental perturbations, and edge-case errorsโthereby establishing new standards and empirical foundations for trustworthy AI deployment.
๐ Abstract
AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.