A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work studies the finite-time convergence of Q-learning under online policy sampling—i.e., with a time-varying behavior policy—and assumes only the existence of a policy inducing an irreducible Markov chain over the state space—the weakest ergodicity condition established to date. To handle non-homogeneous, rapidly time-varying Markov noise, we develop, for the first time, an explicit convergence rate for the final iterate of Q-learning under time-varying policies. Our analysis introduces a unified framework combining Poisson equation decomposition with sensitivity analysis of lazy transition matrices. We prove an $mathcal{O}(1/varepsilon^2)$ sample complexity and derive explicit convergence rates for both the Q-function estimates and the policy sequence. Numerical experiments confirm accelerated convergence under exploration-exploitation trade-offs. The proposed analytical tools extend naturally to general single-timescale stochastic approximation algorithms.

Technology Category

Application Category

📝 Abstract
In this work, we present the first finite-time analysis of the Q-learning algorithm under time-varying learning policies (i.e., on-policy sampling) with minimal assumptions -- specifically, assuming only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $mathbb{E}[|Q_k - Q^*|_infty^2]$, implying a sample complexity of order $O(1/ε^2)$ for achieving $mathbb{E}[|Q_k - Q^*|_infty] le ε$, matching that of off-policy Q-learning but with a worse dependence on exploration-related parameters. We also derive an explicit rate for $mathbb{E}[|Q^{π_k} - Q^*|_infty^2]$, where $π_k$ is the learning policy at iteration $k$. These results reveal that on-policy Q-learning exhibits weaker exploration than its off-policy counterpart but enjoys an exploitation advantage, as its policy converges to an optimal one rather than remaining fixed. Numerical simulations corroborate our theory. Technically, the combination of time-varying learning policies (which induce rapidly time-inhomogeneous Markovian noise) and the minimal assumption on exploration presents significant analytical challenges. To address these challenges, we employ a refined approach that leverages the Poisson equation to decompose the Markovian noise corresponding to the lazy transition matrix into a martingale-difference term and residual terms. To control the residual terms under time inhomogeneity, we perform a sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These tools may further facilitate the analysis of general reinforcement learning algorithms with rapidly time-varying learning policies -- such as single-timescale actor--critic methods and learning-in-games algorithms -- and are of independent interest.
Problem

Research questions and friction points this paper is trying to address.

Analyzing Q-learning convergence with time-varying policies under minimal assumptions
Establishing finite-time convergence rates for on-policy Q-learning algorithms
Addressing analytical challenges of rapidly time-inhomogeneous Markovian noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed Q-learning with time-varying policies
Used Poisson equation for Markovian noise decomposition
Performed sensitivity analysis for time inhomogeneity control
🔎 Similar Papers
No similar papers found.