MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

📅 2024-12-16

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 1

career value

228K/year

🤖 AI Summary

Balancing exploration and exploitation in reinforcement learning traditionally relies on heuristic design. This paper proposes MaxInfoRL, the first framework to explicitly model information gain as an intrinsic exploration signal. It jointly maximizes the entropy of the state-action-reward distribution, thereby coupling value-function optimization with information-driven exploration and enabling adaptive trade-offs between intrinsic and extrinsic rewards. Theoretically, MaxInfoRL achieves a sublinear regret bound in the multi-armed bandit setting. Practically, it is plug-and-play compatible with mainstream off-policy algorithms—including SAC and TD3—incorporating Boltzmann policy entropy regularization and supporting visual inputs. Empirical evaluation on hard-exploration benchmarks (e.g., MiniGrid and DeepMind Control Suite) demonstrates substantial improvements in both sample efficiency and final performance. These results validate the generalizability and practical efficacy of information-theoretic exploration.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.

Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and exploitation in reinforcement learning

Maximizing information gain for directed exploration

Improving performance in hard exploration tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

MaxInfoRL maximizes information gain for exploration

Combines intrinsic and extrinsic rewards effectively

Applies to continuous state-action spaces

🔎 Similar Papers

Can Learned Optimization Make Reinforcement Learning Less Difficult?