Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the problem of identifying an $(\varepsilon, \delta)$-PAC optimal policy in finite-horizon tabular Markov decision processes. Existing methods suffer from high computational cost and suboptimal dependence on $\log(1/\delta)$. To overcome these limitations, we propose a novel algorithm that integrates posterior sampling with online learning-guided exploration. Leveraging model-based policy evaluation, our approach achieves efficient policy identification with per-iteration computational complexity of $O(S^2 A H)$. Notably, it is the first method to attain asymptotic optimality in both sample complexity and posterior contraction rate, eliminating the suboptimal polynomial dependence on $\log(1/\delta)$. The proposed algorithm thus offers significant theoretical and practical improvements over current state-of-the-art methods.

📝 Abstract

We study the $(\varepsilon, δ)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on $\log(1/δ)$. We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in $O(S^2AH)$ per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on $\log(1/δ)$. Our results provide both theoretical insights and practical tools for efficient policy identification in tabular MDPs.

Problem

Research questions and friction points this paper is trying to address.

policy identification

Markov Decision Processes

PAC learning

sample complexity

posterior sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

posterior sampling

policy identification

sample complexity