🤖 AI Summary
This paper addresses the long-run average-cost reinforcement learning problem with inequality constraints. We propose, for the first time, a natural actor–critic algorithm based on function approximation and establish its finite-time convergence guarantee—filling a critical gap, as prior work lacks non-asymptotic convergence guarantees under both average-cost criteria and constraint satisfaction. Our method integrates two-timescale stochastic approximation, linear function approximation, and projected gradient-based constrained optimization to jointly update policy and value function estimates. We derive theoretically grounded optimal step-size schedules and provide an improved sample complexity bound. Empirical evaluation on Safety-Gym demonstrates competitive performance against state-of-the-art constrained RL algorithms. The core contribution is the first non-asymptotic convergence analysis framework for natural actor–critic methods under constrained average-cost settings, combining rigorous theoretical foundations with practical implementability.
📝 Abstract
Recent studies have increasingly focused on non-asymptotic convergence analyses for actor-critic (AC) algorithms. One such effort introduced a two-timescale critic-actor algorithm for the discounted cost setting using a tabular representation, where the usual roles of the actor and critic are reversed. However, only asymptotic convergence was established there. Subsequently, both asymptotic and non-asymptotic analyses of the critic-actor algorithm with linear function approximation were conducted. In our work, we introduce the first natural critic-actor algorithm with function approximation for the long-run average cost setting and under inequality constraints. We provide the non-asymptotic convergence guarantees for this algorithm. Our analysis establishes optimal learning rates and we also propose a modification to enhance sample complexity. We further show the results of experiments on three different Safety-Gym environments where our algorithm is found to be competitive in comparison with other well known algorithms.