🤖 AI Summary
This paper studies the model-free estimation of the optimal Q-function for discounted Markov decision processes (MDPs) under the synchronous setting. For environments where a generative model permits sampling from every state-action pair, we propose VRCQ—a novel algorithm that introduces, for the first time, the Cascade Q-learning variance-reduction paradigm, integrating direct variance control with a cascaded iterative update mechanism. Theoretically, VRCQ achieves the minimax-optimal sample complexity in the ℓ∞-norm for optimal Q-function estimation. In the special case of policy evaluation, it attains non-asymptotic instance optimality—matching the fundamental information-theoretic lower bound. Numerical experiments demonstrate that VRCQ significantly outperforms existing model-free methods in both convergence rate and stability.
📝 Abstract
We study the problem of estimating the optimal Q-function of $gamma$-discounted Markov decision processes (MDPs) under the synchronous setting, where independent samples for all state-action pairs are drawn from a generative model at each iteration. We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ). VRCQ comprises two key building blocks: (i) the established direct variance reduction technique and (ii) our proposed variance reduction scheme, Cascade Q-learning. By leveraging these techniques, VRCQ provides superior guarantees in the $ell_infty$-norm compared with the existing model-free stochastic approximation-type algorithms. Specifically, we demonstrate that VRCQ is minimax optimal. Additionally, when the action set is a singleton (so that the Q-learning problem reduces to policy evaluation), it achieves non-asymptotic instance optimality while requiring the minimum number of samples theoretically possible. Our theoretical results and their practical implications are supported by numerical experiments.