🤖 AI Summary
This work addresses a critical issue in reinforcement learning: imperfect world models can induce erroneous policy preferences—falsely deeming one policy superior to another when the opposite holds in the true environment. The paper formally introduces the notion of “model exploitability,” establishes its theoretical connection to reward hacking, and proves that this phenomenon is unavoidable over large policy classes. To mitigate this, the authors propose a relaxed definition of exploitability and derive a planning horizon within which safe decision-making can be guaranteed. The analysis further reveals a fundamental limitation of existing anti-hacking conditions based on restricted policy sets, thereby delineating new theoretical boundaries and offering principled design guidelines for safe reinforcement learning.
📝 Abstract
We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.