Log analysis is necessary for credible evaluation of AI agents

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

172K/year
🤖 AI Summary
Current evaluations of AI agents rely excessively on binary outcome metrics (pass/fail), rendering them vulnerable to shortcut learning, benchmark artifacts, and hazardous behaviors—leading to misjudgments of capability, inaccurate utility predictions, and overlooked safety risks. This work proposes a trustworthy evaluation framework grounded in systematic log analysis, introducing the first log-driven threat taxonomy and accompanying analytical guidelines. By integrating qualitative and quantitative methods, the framework enables fine-grained diagnostics on benchmarks such as tau-Bench Airline. Empirical results reveal that the performance of pass⁵ is underestimated by nearly 50% under conventional evaluation, while uncovering multiple deployment-level failure modes. The study concludes with actionable recommendations for developers, evaluators, and regulators to operationalize log-based assessment in practice.
📝 Abstract
Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by shortcuts and benchmark artifacts, misrepresenting capability. Second, benchmark performance may fail to predict real-world utility due to scaffold limitations and recurring failure modes. Finally, capability scores may conceal dangerous or catastrophic actions taken by the agent. We argue that log analysis -- the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent -- is necessary to overcome these validity threats and promote credible agent evaluation. In this paper, we (1) present a taxonomy of threats to credible evaluation documented through log analysis, and (2) develop a set of guiding principles for log analysis. We illustrate these principles on tau-Bench Airline, revealing that pass^5 performance was under-elicited by nearly 50% and surfacing deployment failure modes invisible to outcome metrics. We conclude with pragmatic recommendations to increase uptake of log analysis, directed at diverse stakeholders including benchmark creators, model developers, independent evaluators, and deployers.
Problem

Research questions and friction points this paper is trying to address.

log analysis
AI agent evaluation
benchmark validity
failure modes
evaluation credibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

log analysis
agent evaluation
benchmark validity
failure mode detection
AI safety
🔎 Similar Papers
No similar papers found.