Achieving $ε^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work investigates the sample complexity of off-policy Actor-Critic algorithms under a single-loop, single-timescale setting. Under mild assumptions—requiring only that the Markov chain induced by the behavior policy is irreducible, without imposing strong mixing conditions or uniform exploration—it introduces a coupling Lyapunov drift analysis framework grounded in cross-dominance properties to characterize the joint dynamics of the actor and critic. The paper establishes, for the first time, that the algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ in this setting: the actor converges geometrically to an $\varepsilon$-optimal policy, while the critic converges at a rate of $\tilde{\mathcal{O}}(1/T)$. These results significantly relax the stringent exploration and mixing assumptions prevalent in existing theoretical analyses.

📝 Abstract

In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity guarantee for finding an $ε$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded

Problem

Research questions and friction points this paper is trying to address.

sample complexity

actor-critic

reinforcement learning

single-loop

off-policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

single-loop actor-critic

sample complexity

off-policy reinforcement learning