🤖 AI Summary
In online reinforcement learning, existing exploration methods—based on pseudo-counts or prediction errors—suffer from poor scalability to high-dimensional state spaces and weak theoretical guarantees. This paper proposes Randomized Distributional Distillation (RDD), the first approach to explicitly model target network outputs as learnable Gaussian distributions and construct bounded intrinsic rewards via distribution matching. RDD unifies pseudo-count and prediction-error paradigms within a single probabilistic framework. Its intrinsic reward admits a theoretically grounded decomposition into a decaying pseudo-count term and a converging divergence term, enabling both asymptotic exploration annealing and robust high-dimensional applicability. Empirically, RDD significantly outperforms baselines—including ICM, RND, and NGU—on Atari and DeepMind Control Suite benchmarks. Theoretically, we prove convergence of the intrinsic reward and establish a lower bound on exploration efficiency, demonstrating synergistic improvements in sample efficiency and exploration performance.
📝 Abstract
Exploration remains a critical challenge in online reinforcement learning, as an agent must effectively explore unknown environments to achieve high returns. Currently, the main exploration algorithms are primarily count-based methods and curiosity-based methods, with prediction-error methods being a prominent example. In this paper, we propose a novel method called extbf{R}andom extbf{D}istribution extbf{D}istillation (RDD), which samples the output of a target network from a normal distribution. RDD facilitates a more extensive exploration by explicitly treating the difference between the prediction network and the target network as an intrinsic reward. Furthermore, by introducing randomness into the output of the target network for a given state and modeling it as a sample from a normal distribution, intrinsic rewards are bounded by two key components: a pseudo-count term ensuring proper exploration decay and a discrepancy term accounting for predictor convergence. We demonstrate that RDD effectively unifies both count-based and prediction-error approaches. It retains the advantages of prediction-error methods in high-dimensional spaces, while also implementing an intrinsic reward decay mode akin to the pseudo-count method. In the experimental section, RDD is compared with more advanced methods in a series of environments. Both theoretical analysis and experimental results confirm the effectiveness of our approach in improving online exploration for reinforcement learning tasks.