🤖 AI Summary
This work addresses the pervasive issue of unreliable evaluation and poor reproducibility in agent-based research, often stemming from missing rollout records. To remedy this, the authors propose Rollout Cards—a formal reproducibility standard centered on complete rollout trajectories—as structured containers for execution logs, scoring rules, and data exclusion criteria. This approach institutionalizes rollout documentation for the first time and is integrated into the open-source platform Ergon, enabling cross-task re-evaluation. Empirical studies across four domains demonstrate that modifying only the scoring protocol can shift performance scores by up to 20.9 percentage points and even reverse model rankings, underscoring the critical role of transparent, standardized rollout reporting in ensuring result credibility.
📝 Abstract
Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.