Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms

📅 2023-10-25

🏛️ Conference on Uncertainty in Artificial Intelligence

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This paper studies constrained Markov decision processes (C-MDPs) with inequality constraints under non-i.i.d. (Markovian) sampling. For the function approximation setting, we propose two three-timescale algorithms: Constrained Actor-Critic (C-AC) and Constrained Natural Actor-Critic (C-NAC). We establish, for the first time, non-asymptotic convergence guarantees for constrained AC/NAC-type algorithms under Markovian sampling, rigorously proving that both algorithms converge to a first-order stationary point of the Lagrangian with sample complexity $O(varepsilon^{-2.5})$. Our analysis unifies Lagrangian duality, natural policy gradients, Markovian noise control, and multi-timescale stochastic approximation. Experiments on the Safety-Gym benchmark demonstrate the algorithms’ effectiveness in satisfying constraints while maintaining stable policy performance.

📝 Abstract

Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., $Vert abla L( heta,gamma)Vert_2^2 leq epsilon$) of the performance (Lagrange) function $L( heta,gamma)$, with a sample complexity of $mathcal{ ilde{O}}(epsilon^{-2.5})$ in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms. We also show the results of experiments on three different Safety-Gym environments.

Problem

Research questions and friction points this paper is trying to address.

Finite-time analysis of constrained actor-critic algorithms

Solving constrained Markov decision processes with inequality constraints

Achieving first-order stationary points with proven sample complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-timescale constrained actor-critic algorithms

Lagrange multiplier method for inequality constraints

Non-asymptotic analysis with Markovian sampling

🔎 Similar Papers

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach