🤖 AI Summary
Reinforcement learning agents often incur safety risks in multi-task settings due to relaxed safety constraints, and existing methods struggle to jointly ensure safety and performance. To address this, we propose Constrained Trust-Region Policy Optimization (C-TRPO), the first method to embed safety constraints directly into the trust-region definition, thereby guaranteeing strict constraint satisfaction throughout policy updates. C-TRPO is formulated within the Constrained Markov Decision Process (CMDP) framework and integrates natural policy gradients on Riemannian manifolds with a projection-based trust-region construction, unifying TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO) from a geometric perspective in policy space. Experiments on multiple safety-critical RL benchmarks demonstrate that C-TRPO significantly reduces constraint violation rates while maintaining cumulative reward competitive with state-of-the-art algorithms—thus achieving a principled trade-off between safety guarantees and optimality.
📝 Abstract
Reinforcement Learning (RL) agents can solve diverse tasks but often exhibit unsafe behavior. Constrained Markov Decision Processes (CMDPs) address this by enforcing safety constraints, yet existing methods either sacrifice reward maximization or allow unsafe training. We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes the policy space geometry to ensure trust regions contain only safe policies, guaranteeing constraint satisfaction throughout training. We analyze its theoretical properties and connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.