On Randomness in Agentic Evals

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study addresses the significant randomness inherent in single-run evaluation (pass@1) for agent benchmarking, which often leads to misleading conclusions about performance improvements. By collecting 60,000 agent trajectories, the authors systematically analyze evaluation variance and reveal that even at temperature zero, early trajectory divergence induces substantial performance fluctuations. To mitigate this issue, the work proposes multi-run evaluation protocols, statistical power analysis, and extended metrics such as pass@k and pass^k, complemented by token-level trajectory analysis. For the first time, the study quantifies how evaluation noise affects judgments of algorithmic progress, demonstrating that pass@1 estimates exhibit biases of 2.2–6.0 percentage points and standard deviations exceeding 1.5 percentage points—indicating that minor reported gains frequently stem from evaluation noise rather than genuine advancements.

Technology Category

Application Category

📝 Abstract

Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.

Problem

Research questions and friction points this paper is trying to address.

agentic evaluation

randomness

pass@1 variance

evaluation reliability

statistical noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic evaluation

evaluation variance

pass@1 reliability