🤖 AI Summary
To address low exploration efficiency, poor policy interpretability, and high computational overhead in reinforcement learning, this paper proposes β-DQN. It models state-dependent action generation probabilities via a behavior function β to construct a diverse population of policies and introduces an adaptive meta-controller for online, policy-level selection. Methodologically, it innovatively integrates a lightweight, behavior-statistics-based exploration mechanism, state-action coverage guidance, and Q-value overestimation correction—jointly enhancing exploration quality and interpretability. Empirically, β-DQN significantly outperforms baselines—including ε-greedy, NoisyNet, and Bootstrapped DQN—across diverse exploration-intensive tasks. It achieves improved sample efficiency and final performance with minimal additional computational cost, demonstrating both the effectiveness and practicality of policy-level exploration modeling.
📝 Abstract
While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like $epsilon$-greedy. Motivated by this, we introduce $eta$-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function $eta$. This function estimates the probability that each action has been taken at each state. By leveraging $eta$, we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. $eta$-DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that $eta$-DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.