🤖 AI Summary
Balancing exploration and exploitation in reinforcement learning traditionally relies on heuristic design. This paper proposes MaxInfoRL, the first framework to explicitly model information gain as an intrinsic exploration signal. It jointly maximizes the entropy of the state-action-reward distribution, thereby coupling value-function optimization with information-driven exploration and enabling adaptive trade-offs between intrinsic and extrinsic rewards. Theoretically, MaxInfoRL achieves a sublinear regret bound in the multi-armed bandit setting. Practically, it is plug-and-play compatible with mainstream off-policy algorithms—including SAC and TD3—incorporating Boltzmann policy entropy regularization and supporting visual inputs. Empirical evaluation on hard-exploration benchmarks (e.g., MiniGrid and DeepMind Control Suite) demonstrates substantial improvements in both sample efficiency and final performance. These results validate the generalizability and practical efficacy of information-theoretic exploration.
📝 Abstract
Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.