🤖 AI Summary
This work addresses the challenge of computing deterministic optimal policies for constrained Markov decision processes (CMDPs) with continuous state-action spaces. Existing policy gradient methods struggle in this setting due to their reliance on stochastic policies and discrete action enumeration, rendering them ill-suited for continuous domains and hard constraints. To overcome this bottleneck, we propose the first provably convergent deterministic policy gradient primal-dual algorithm (D-PGPD). D-PGPD establishes a non-asymptotic, function-approximation-compatible framework for deterministic policy search and introduces a quadratically regularized Lagrangian primal-dual update to handle hard constraints in continuous spaces. We prove that D-PGPD converges sublinearly to the regularized optimal primal-dual solution and provide explicit bounds on approximation error induced by function approximation. Empirical evaluation on robotic navigation and fluid control tasks demonstrates substantial improvements over state-of-the-art baselines.
📝 Abstract
We study the problem of computing deterministic optimal policies for constrained Markov decision processes (MDPs) with continuous state and action spaces, which are widely encountered in constrained dynamical systems. Designing deterministic policy gradient methods in continuous state and action spaces is particularly challenging due to the lack of enumerable state-action pairs and the adoption of deterministic policies, hindering the application of existing policy gradient methods. To this end, we develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence. Specifically, we leverage regularization of the Lagrangian of the constrained MDP to propose a deterministic policy gradient primal-dual (D-PGPD) algorithm that updates the deterministic policy via a quadratic-regularized gradient ascent step and the dual variable via a quadratic-regularized gradient descent step. We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair. We instantiate D-PGPD with function approximation and prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair, up to a function approximation error. Furthermore, we demonstrate the effectiveness of our method in two continuous control problems: robot navigation and fluid control. This appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.