🤖 AI Summary
To address the challenge of joint optimization between constraint embedding and policy learning in constrained reinforcement learning (CRL), this paper establishes a unified equivalence framework bridging CRL and feedback control. Specifically, Lagrange multiplier updates are reformulated as an optimal feedback control problem, and a multiplier-guided policy learning mechanism is introduced to enable end-to-end co-optimization. Theoretically, we show that PID-Lagrangian methods constitute only a special case within this broader framework. Methodologically, we pioneer the integration of model predictive control (MPC) into Lagrangian optimization, proposing Predictive Lagrangian Optimization (PLO)—a novel paradigm for adaptive constraint handling. Evaluated on a multi-task constrained RL benchmark, PLO significantly expands the feasible policy region (+7.2%) while preserving average reward performance, demonstrating its effectiveness, generalizability, and robustness.
📝 Abstract
Constrained optimization is popularly seen in reinforcement learning for addressing complex control tasks. From the perspective of dynamic system, iteratively solving a constrained optimization problem can be framed as the temporal evolution of a feedback control system. Classical constrained optimization methods, such as penalty and Lagrangian approaches, inherently use proportional and integral feedback controllers. In this paper, we propose a more generic equivalence framework to build the connection between constrained optimization and feedback control system, for the purpose of developing more effective constrained RL algorithms. Firstly, we define that each step of the system evolution determines the Lagrange multiplier by solving a multiplier feedback optimal control problem (MFOCP). In this problem, the control input is multiplier, the state is policy parameters, the dynamics is described by policy gradient descent, and the objective is to minimize constraint violations. Then, we introduce a multiplier guided policy learning (MGPL) module to perform policy parameters updating. And we prove that the resulting optimal policy, achieved through alternating MFOCP and MGPL, aligns with the solution of the primal constrained RL problem, thereby establishing our equivalence framework. Furthermore, we point out that the existing PID Lagrangian is merely one special case within our framework that utilizes a PID controller. We also accommodate the integration of other various feedback controllers, thereby facilitating the development of new algorithms. As a representative, we employ model predictive control (MPC) as the feedback controller and consequently propose a new algorithm called predictive Lagrangian optimization (PLO). Numerical experiments demonstrate its superiority over the PID Lagrangian method, achieving a larger feasible region up to 7.2% and a comparable average reward.