Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

📅 2024-08-13
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the model-free estimation of the optimal Q-function for discounted Markov decision processes (MDPs) under the synchronous setting. For environments where a generative model permits sampling from every state-action pair, we propose VRCQ—a novel algorithm that introduces, for the first time, the Cascade Q-learning variance-reduction paradigm, integrating direct variance control with a cascaded iterative update mechanism. Theoretically, VRCQ achieves the minimax-optimal sample complexity in the ℓ∞-norm for optimal Q-function estimation. In the special case of policy evaluation, it attains non-asymptotic instance optimality—matching the fundamental information-theoretic lower bound. Numerical experiments demonstrate that VRCQ significantly outperforms existing model-free methods in both convergence rate and stability.

Technology Category

Application Category

📝 Abstract
We study the problem of estimating the optimal Q-function of $gamma$-discounted Markov decision processes (MDPs) under the synchronous setting, where independent samples for all state-action pairs are drawn from a generative model at each iteration. We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ). VRCQ comprises two key building blocks: (i) the established direct variance reduction technique and (ii) our proposed variance reduction scheme, Cascade Q-learning. By leveraging these techniques, VRCQ provides superior guarantees in the $ell_infty$-norm compared with the existing model-free stochastic approximation-type algorithms. Specifically, we demonstrate that VRCQ is minimax optimal. Additionally, when the action set is a singleton (so that the Q-learning problem reduces to policy evaluation), it achieves non-asymptotic instance optimality while requiring the minimum number of samples theoretically possible. Our theoretical results and their practical implications are supported by numerical experiments.
Problem

Research questions and friction points this paper is trying to address.

Estimating optimal Q-function in discounted MDPs
Improving sample complexity via variance reduction
Achieving minimax optimality in Q-learning algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variance-Reduced Cascade Q-learning algorithm
Direct variance reduction technique
Cascade Q-learning variance reduction scheme
M
Mohammad Boveiri
Delft Center for Systems and Control, Delft University of Technology, Delft, The Netherlands
Peyman Mohajerin Esfahani
Peyman Mohajerin Esfahani
Unknown affiliation