Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms

πŸ“… 2023-10-25
πŸ›οΈ Conference on Uncertainty in Artificial Intelligence
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper studies constrained Markov decision processes (C-MDPs) with inequality constraints under non-i.i.d. (Markovian) sampling. For the function approximation setting, we propose two three-timescale algorithms: Constrained Actor-Critic (C-AC) and Constrained Natural Actor-Critic (C-NAC). We establish, for the first time, non-asymptotic convergence guarantees for constrained AC/NAC-type algorithms under Markovian sampling, rigorously proving that both algorithms converge to a first-order stationary point of the Lagrangian with sample complexity $O(varepsilon^{-2.5})$. Our analysis unifies Lagrangian duality, natural policy gradients, Markovian noise control, and multi-timescale stochastic approximation. Experiments on the Safety-Gym benchmark demonstrate the algorithms’ effectiveness in satisfying constraints while maintaining stable policy performance.
πŸ“ Abstract
Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., $Vert abla L( heta,gamma)Vert_2^2 leq epsilon$) of the performance (Lagrange) function $L( heta,gamma)$, with a sample complexity of $mathcal{ ilde{O}}(epsilon^{-2.5})$ in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms. We also show the results of experiments on three different Safety-Gym environments.
Problem

Research questions and friction points this paper is trying to address.

Finite-time analysis of constrained actor-critic algorithms
Solving constrained Markov decision processes with inequality constraints
Achieving first-order stationary points with proven sample complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-timescale constrained actor-critic algorithms
Lagrange multiplier method for inequality constraints
Non-asymptotic analysis with Markovian sampling
P
Prashansa Panda
Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India
Shalabh Bhatnagar
Shalabh Bhatnagar
Professor in the Department of Computer Science and Automation, Indian Institute of Science
Stochastic systemscontrolsimulationoptimization