🤖 AI Summary
This work addresses the limitations of conventional pixel-based deep reinforcement learning, which often relies on low-resolution inputs and struggles to exploit fine-grained visual details. The authors systematically investigate the impact of visual resolution on policy learning and propose Impoola, a resolution-agnostic architecture that replaces the flattening operation in Impala with global average pooling, thereby decoupling model parameters from input resolution. By introducing Procgen-HD—a high-resolution benchmark—and employing gradient saliency analysis, they demonstrate that policies trained at higher resolutions can effectively attend to critical local regions. Experimental results show that Impoola achieves a 28% performance improvement over Impala under optimal configurations, with particularly pronounced gains in tasks requiring the recognition of small or distant objects.
📝 Abstract
Pixel-based deep reinforcement learning agents are typically trained on heavily downsampled visual observations, a convention inherited from early benchmarks rather than grounded in principled design. In this work, we show that observation resolution is a critical yet overlooked variable for policy learning: higher-resolution inputs can substantially improve both performance and generalization, provided the network architecture can process them effectively. We find that the widely used Impala encoder, which flattens spatial features into a vector, suffers from quadratic parameter growth as resolution increases and fails to leverage the additional visual detail. Replacing this operation with global average pooling, as in the Impoola architecture, decouples parameter count from resolution and yields consistent improvements across resolutions and network widths - at their respective best conditions, visual scaling unlocks a 28 % performance gain for Impoola over Impala. These gains are strongest in environments that require precise perception of small or distant objects, and gradient saliency analysis confirms that the underlying mechanism is a more spatially localized visual attention of the policy at higher resolutions. Our results challenge the prevailing practice of aggressive input downsampling and position resolution-independent architectures as a simple, effective path toward scalable visual deep RL. To facilitate future research on resolution scaling in deep RL, we publicly release the open-source code for the Procgen-HD benchmark: https://github.com/raphajaner/procgen-hd.