Model-free Posterior Sampling via Learning Rate Randomization

📅 2023-10-27

🏛️ Neural Information Processing Systems

📈 Citations: 4

✨ Influential: 1

career value

207K/year

🤖 AI Summary

This paper addresses regret minimization in episodic Markov decision processes (MDPs) for model-free reinforcement learning. We propose RandQL, the first implementable model-free posterior sampling algorithm. Its key innovations are: (1) a randomized learning-rate mechanism that enables optimistic exploration without explicit bonus-based confidence rewards; and (2) the first rigorous realization of posterior sampling in a model-free setting—requiring neither an environment model nor reward shaping. Theoretically, RandQL achieves a regret bound of $ ilde{O}(sqrt{H^5SAT})$ for tabular MDPs and $ ilde{O}(H^{5/2}T^{(d_z+1)/(d_z+2)})$ for MDPs with metric structure, where $d_z$ denotes the zooming dimension. Empirical evaluation demonstrates that RandQL significantly outperforms state-of-the-art exploration algorithms across benchmark tasks.

📝 Abstract

In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order $widetilde{mathcal{O}}(sqrt{H^{5}SAT})$, where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the number of episodes. For a metric state-action space, RandQL enjoys a regret bound of order $widetilde{mathcal{O}}(H^{5/2} T^{(d_z+1)/(d_z+2)})$, where $d_z$ denotes the zooming dimension. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization. Our empirical study shows that RandQL outperforms existing approaches on baseline exploration environments.

Problem

Research questions and friction points this paper is trying to address.

Minimizing regret in episodic Markov Decision Processes

Developing first tractable model-free posterior sampling algorithm

Achieving optimistic exploration without bonus-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomized Q-learning for regret minimization

Model-free posterior sampling-based algorithm

Learning rate randomization for exploration

🔎 Similar Papers

An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning