Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

187K/year
🤖 AI Summary
Current interactive agent benchmarks often rely on superficial signals to determine task success, making it difficult to reliably verify whether the intended goal state has been genuinely achieved, thereby distorting evaluation outcomes. This work proposes a general-purpose augmentation layer that requires no modifications to tasks, agents, or evaluators. It introduces predefined validation artifacts and executes a locked checklist, explicitly handling uncertainty through three evidence labels: Evidence Pass, Fail, and Unknown. By constructing an evidence-backed scoring boundary mechanism, the approach isolates uncertain cases from conventional binary success rates, substantially enhancing evaluation transparency and reliability. Experiments across five benchmarks—including ANDROIDWORLD and AGENTDOJO—demonstrate that the framework effectively identifies diverse real-world failure modes and uncovers biases obscured by traditional evaluation protocols.
📝 Abstract
Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes.
Problem

Research questions and friction points this paper is trying to address.

interactive-agent evaluation
outcome verification
benchmark reliability
evidence-supported bounds
agent benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

outcome evidence
interactive agent evaluation
evidence-supported bounds
benchmark reliability
evaluation transparency
🔎 Similar Papers
No similar papers found.