Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current AI agent benchmarks are vulnerable to reward hacking, where agents achieve high scores through unintended behaviors without genuinely completing tasks. To address this, this work proposes BenchJack, the first systematic and automated red-teaming auditing and hardening framework. BenchJack leverages eight empirically derived benchmark vulnerability patterns, integrating code-generating agents with clairvoyant vulnerability discovery techniques to proactively identify and exploit weaknesses. It further introduces a generative adversarial iterative pipeline to repair vulnerabilities and strengthen benchmarks. Evaluated across ten mainstream benchmarks, BenchJack uncovered 219 vulnerabilities, initially achieving near-100% exploitability. After three rounds of iterative hardening, the proportion of exploitable tasks dropped below 10%, with WebArena and OSWorld fully secured.

📝 Abstract

Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

AI agent benchmarks

benchmark security

evaluation robustness

adversarial auditing

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward hacking

benchmark auditing

red-teaming