🤖 AI Summary
Standard Prioritized Experience Replay (PER) fails to distinguish experiences by their learning potential, resulting in suboptimal sampling efficiency. To address this, we propose Reliability-Adjusted PER (RA-PER), which introduces a temporal-difference (TD) error reliability metric—dynamically adjusting sample priorities based on the variance stability of TD errors to preferentially select high-information, low-noise transitions. Theoretically, RA-PER achieves superior sample efficiency compared to standard PER under mild assumptions. Empirically, RA-PER demonstrates significantly faster convergence and improved final performance across benchmark domains, including Atari-5, without requiring auxiliary networks or additional hyperparameter tuning.
📝 Abstract
Experience replay enables data-efficient learning from past experiences in online reinforcement learning agents. Traditionally, experiences were sampled uniformly from a replay buffer, regardless of differences in experience-specific learning potential. In an effort to sample more efficiently, researchers introduced Prioritized Experience Replay (PER). In this paper, we propose an extension to PER by introducing a novel measure of temporal difference error reliability. We theoretically show that the resulting transition selection algorithm, Reliability-adjusted Prioritized Experience Replay (ReaPER), enables more efficient learning than PER. We further present empirical results showing that ReaPER outperforms PER across various environment types, including the Atari-5 benchmark.