🤖 AI Summary
This paper addresses regret minimization in episodic Markov decision processes (MDPs) for model-free reinforcement learning. We propose RandQL, the first implementable model-free posterior sampling algorithm. Its key innovations are: (1) a randomized learning-rate mechanism that enables optimistic exploration without explicit bonus-based confidence rewards; and (2) the first rigorous realization of posterior sampling in a model-free setting—requiring neither an environment model nor reward shaping. Theoretically, RandQL achieves a regret bound of $ ilde{O}(sqrt{H^5SAT})$ for tabular MDPs and $ ilde{O}(H^{5/2}T^{(d_z+1)/(d_z+2)})$ for MDPs with metric structure, where $d_z$ denotes the zooming dimension. Empirical evaluation demonstrates that RandQL significantly outperforms state-of-the-art exploration algorithms across benchmark tasks.
📝 Abstract
In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order $widetilde{mathcal{O}}(sqrt{H^{5}SAT})$, where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the number of episodes. For a metric state-action space, RandQL enjoys a regret bound of order $widetilde{mathcal{O}}(H^{5/2} T^{(d_z+1)/(d_z+2)})$, where $d_z$ denotes the zooming dimension. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization. Our empirical study shows that RandQL outperforms existing approaches on baseline exploration environments.