🤖 AI Summary
Deep reinforcement learning (DRL) agents exhibit severe zero-shot generalization failure under task simplification, suffering over 70% average performance degradation—revealing critical overreliance on environmental shortcuts and a fundamental gap from human-like robust adaptability. This paper presents the first systematic evaluation of such generalization failures in simplified tasks and introduces HackAtari, a novel benchmark built upon the Arcade Learning Environment (ALE) that enables controllable, dynamic, and scalable systematic generalization assessment. Methodologically, we conduct comparative analysis across multiple algorithms (e.g., DQN, PPO) and neural architectures, demonstrating that standard training paradigms fail to induce human-like adaptivity. Our core contributions are threefold: (1) identifying task simplification as a pivotal generalization stressor; (2) establishing the first dynamic, systematically generalizable evaluation framework; and (3) providing a reproducible benchmark and diagnostic toolkit to advance DRL toward human-level adaptive intelligence.
📝 Abstract
Deep reinforcement learning (RL) agents achieve impressive results in a wide variety of tasks, but they lack zero-shot adaptation capabilities. While most robustness evaluations focus on tasks complexifications, for which human also struggle to maintain performances, no evaluation has been performed on tasks simplifications. To tackle this issue, we introduce HackAtari, a set of task variations of the Arcade Learning Environments. We use it to demonstrate that, contrary to humans, RL agents systematically exhibit huge performance drops on simpler versions of their training tasks, uncovering agents' consistent reliance on shortcuts. Our analysis across multiple algorithms and architectures highlights the persistent gap between RL agents and human behavioral intelligence, underscoring the need for new benchmarks and methodologies that enforce systematic generalization testing beyond static evaluation protocols. Training and testing in the same environment is not enough to obtain agents equipped with human-like intelligence.