Q-learning with Posterior Sampling

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

📄 PDF

career value

197K/year

🤖 AI Summary

研究如何用贝叶斯后验采样改进Q学习探索效率，提出PSQL算法，在表格化MDP中实现接近理论下限的遗憾界。

Technology Category

Application Category

📝 Abstract

Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $ ilde O(H^2sqrt{SAT})$, closely matching the known lower bound of $Omega(Hsqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.

Problem

Research questions and friction points this paper is trying to address.

Analyzing theoretical performance of Bayesian posterior sampling in RL

Developing Q-learning with posterior sampling for efficient exploration

Achieving near-optimal regret bounds in tabular episodic MDPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Q-learning with Gaussian posterior sampling

Regret bound matching known lower bound

Combining posterior sampling with dynamic programming

🔎 Similar Papers

Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning