🤖 AI Summary
This paper addresses the challenge of efficient intra-episode exploration of unseen states in reinforcement learning. We propose a meta-learning-driven active exploration framework that, for the first time, integrates nonparametric memory-based density estimation with meta-learning to construct a dynamic memory model. Intrinsic rewards are derived from the low probability density of novel observations under this model, guiding a recurrent policy network to optimize exploration behavior online within a single episode. Our contributions are threefold: (1) unsupervised, online exploration policy adaptation via density-based feedback; (2) elimination of reliance on environmental priors, state visitation counts, or prediction-error signals; and (3) zero-shot transferability to novel sparse-reward tasks. Experiments demonstrate significant improvements in both exploration efficiency and generalization over conventional exploration methods.
📝 Abstract
Exploration algorithms for reinforcement learning typically replace or augment the reward function with an additional ``intrinsic'' reward that trains the agent to seek previously unseen states of the environment. Here, we consider an exploration algorithm that exploits meta-learning, or learning to learn, such that the agent learns to maximize its exploration progress within a single episode, even between epochs of training. The agent learns a policy that aims to minimize the probability density of new observations with respect to all of its memories. In addition, it receives as feedback evaluations of the current observation density and retains that feedback in a recurrent network. By remembering trajectories of density, the agent learns to navigate a complex and growing landscape of familiarity in real-time, allowing it to maximize its exploration progress even in completely novel states of the environment for which its policy has not been trained.