Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Current interactive agent benchmarks often rely on superficial signals to determine task success, making it difficult to reliably verify whether the intended goal state has been genuinely achieved, thereby distorting evaluation outcomes. This work proposes a general-purpose augmentation layer that requires no modifications to tasks, agents, or evaluators. It introduces predefined validation artifacts and executes a locked checklist, explicitly handling uncertainty through three evidence labels: Evidence Pass, Fail, and Unknown. By constructing an evidence-backed scoring boundary mechanism, the approach isolates uncertain cases from conventional binary success rates, substantially enhancing evaluation transparency and reliability. Experiments across five benchmarks—including ANDROIDWORLD and AGENTDOJO—demonstrate that the framework effectively identifies diverse real-world failure modes and uncovers biases obscured by traditional evaluation protocols.

📝 Abstract

Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes.

Problem

Research questions and friction points this paper is trying to address.

interactive-agent evaluation

outcome verification

benchmark reliability

evidence-supported bounds

agent benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

outcome evidence

interactive agent evaluation

evidence-supported bounds