🤖 AI Summary
This study investigates the finite-sample convergence and distributional approximation of asynchronous Q-learning in high-dimensional settings. Under the assumption that the state-action-next-state sequence forms a uniformly geometrically ergodic Markov chain, the authors establish—using polynomial step sizes and Polyak–Ruppert averaging—the first convergence rate of order \(n^{-1/6} \log^4(nSA)\) over high-dimensional hyperrectangles. They further prove a high-dimensional central limit theorem for the averaged iterates, providing explicit Gaussian approximation error bounds, and derive higher-order moment upper bounds for the final iterate. By integrating Markov chain ergodic theory, high-dimensional probabilistic limit theorems, and martingale difference analysis, this work lays a rigorous theoretical foundation for statistical inference in asynchronous reinforcement learning.
📝 Abstract
In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates generated by the asynchronous Q-learning algorithm with a polynomial stepsize $k^{-ω},\, ω\in (1/2, 1]$. Assuming that the sequence of state-action-next-state triples $(s_k, a_k, s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, we establish a rate of order up to $n^{-1/6} \log^{4} (nS A)$ over the class of hyper-rectangles, where $n$ is the number of samples used by the algorithm and $S$ and $A$ denote the numbers of states and actions, respectively. To obtain this result, we prove a high-dimensional central limit theorem for sums of martingale differences, which may be of independent interest. Finally, we present bounds for high-order moments for the algorithm's last iterate.