🤖 AI Summary
Existing approaches struggle to scalably and reliably evaluate whether agents engage in reward hacking during alignment with human intent. This work proposes a novel evaluation paradigm that enables deterministic detection and automated assessment of such behavior by verifiably embedding reward vulnerabilities into the environment, overcoming the limitations of prior methods that rely on post-hoc trajectory analysis. Building upon TextArena, we introduce Hack-Verifiable TextArena—a testbed that integrates verifiable environment design with language model behavioral analysis—to achieve, for the first time, systematic and reproducible evaluation of reward hacking. The platform is open-sourced and supports benchmarking diverse language models across a range of tasks.
📝 Abstract
Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce a new evaluation paradigm for measuring reward hacking. Whereas prior studies have primarily analyzed it post hoc by inspecting agent trajectories, we instead embed detectable reward hacking opportunities directly into environments. This makes their exploitation verifiable by design, enabling deterministic and automated measurement of whether and how agents exploit such vulnerabilities. We instantiate this approach in $\textit{TextArena}$ and release $\textit{Hack-Verifiable TextArena}$, a testbed in which reward hacking can be measured reliably. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments and settings. We open source the code at https://github.com/MajoRoth/hack-verifiable-environments/.