🤖 AI Summary
This work investigates the sample complexity of off-policy Actor-Critic algorithms under a single-loop, single-timescale setting. Under mild assumptions—requiring only that the Markov chain induced by the behavior policy is irreducible, without imposing strong mixing conditions or uniform exploration—it introduces a coupling Lyapunov drift analysis framework grounded in cross-dominance properties to characterize the joint dynamics of the actor and critic. The paper establishes, for the first time, that the algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ in this setting: the actor converges geometrically to an $\varepsilon$-optimal policy, while the critic converges at a rate of $\tilde{\mathcal{O}}(1/T)$. These results significantly relax the stringent exploration and mixing assumptions prevalent in existing theoretical analyses.
📝 Abstract
In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity guarantee for finding an $ε$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded