On the Convergence of Monte Carlo UCB for Random-Length Episodic MDPs

📅 2022-09-07
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
The convergence of the Q-function in Monte Carlo Upper Confidence Bound (MC-UCB) remains an open problem for stochastic episode lengths—common in games like Go and poker, and robotic tasks—where existing theoretical analyses are restricted to fixed-horizon, finite-time settings and fail to capture stationary policy behavior. Method: We develop a rigorous analysis integrating Monte Carlo return averaging, UCB-based exploration, martingale convergence theory, and stochastic process techniques, validated through numerical experiments. Results: We establish the first almost-sure convergence guarantee of MC-UCB’s Q-estimates to the optimal Q-function under a broad class of stochastic-episode MDPs—including deterministic and stochastic environments such as Go and Blackjack. This refutes the long-standing conjecture of “generic non-convergence” and subsumes conventional finite-horizon MDPs as a special case. Our work fills a fundamental theoretical gap for MC-UCB beyond fixed-length episodes and provides precise, sufficient conditions for convergence.
📝 Abstract
In reinforcement learning, Monte Carlo algorithms update the Q function by averaging the episodic returns. In the Monte Carlo UCB (MC-UCB) algorithm, the action taken in each state is the action that maximizes the Q function plus a UCB exploration term, which biases the choice of actions to those that have been chosen less frequently. Although there has been significant work on establishing regret bounds for MC-UCB, most of that work has been focused on finite-horizon versions of the problem, for which each episode terminates after a constant number of steps. For such finite-horizon problems, the optimal policy depends both on the current state and the time within the episode. However, for many natural episodic problems, such as games like Go and Chess and robotic tasks, the episode is of random length and the optimal policy is stationary. For such environments, it is an open question whether the Q-function in MC-UCB will converge to the optimal Q function; we conjecture that, unlike Q-learning, it does not converge for all MDPs. We nevertheless show that for a large class of MDPs, which includes stochastic MDPs such as blackjack and deterministic MDPs such as Go, the Q-function in MC-UCB converges almost surely to the optimal Q function. An immediate corollary of this result is that it also converges almost surely for all finite-horizon MDPs. We also provide numerical experiments, providing further insights into MC-UCB.
Problem

Research questions and friction points this paper is trying to address.

Convergence of MC-UCB in random-length episodic MDPs
Optimal policy in stochastic and deterministic MDPs
Comparison of MC-UCB convergence with Q-learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monte Carlo UCB algorithm for episodic MDPs
Convergence proof for stochastic and deterministic MDPs
Numerical experiments validating MC-UCB performance