An Executable Benchmarking Suite for Tool-Using Agents

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current evaluations of agent tool use often conflate workload specifications, action generation, and evidentiary criteria, lacking a unified and auditable framework. This work proposes an evaluation paradigm centered on “evidence admissibility gating,” which explicitly decouples workloads, drivers, and verification evidence through a shared evidence admissibility contract. The framework integrates diverse environments—including WebArena Verified, a subset of SWE-Gym, and MiniWoB++—and employs a standardized reporting pipeline comprising a universal workload adapter, declarative drivers, task manifests, event schemas, and replay/freeze strategies. It uniformly logs multidimensional metrics such as latency, invalid actions, and patching costs, enabling consistent differentiation of controller performance under identical workloads while ensuring relevance, reproducibility, and auditability in agent evaluations.

📝 Abstract

Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, and provenance under one auditable contract. The gate is decision-relevant rather than merely clerical: in a separate WebArena Verified controller study, clean-baseline and medium live-stressed evaluation select different fixed controller variants under the same workload and admission contract. The release is scoped as a benchmarking suite and admitted evidence, not a new agent policy, model leaderboard, backend comparison, or autonomous SWE-bench solver.

Problem

Research questions and friction points this paper is trying to address.

tool-using agents

benchmarking

evidence admission

executable environments

evaluation methodology

Innovation

Methods, ideas, or system contributions that make the work stand out.

executable benchmarking

tool-using agents

evidence-admission contract