Central Path Proximal Policy Optimization

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

To address the challenge of simultaneously satisfying constraints and maximizing policy returns in constrained Markov decision processes (CMDPs), this paper proposes a proximal policy optimization (PPO) variant grounded in the central path concept. Its core innovation lies in the first adaptation of the central path paradigm—originally from interior-point optimization—to the policy space: barrier functions are embedded within the policy geometry, and combined with KL-divergence regularization and adaptive Lagrange multipliers to approximate central-path tracking. This formulation eliminates the conventional trade-off between performance and safety in constrained reinforcement learning, ensuring constraint feasibility without sacrificing return. Evaluated on diverse safe RL benchmarks, the method reduces constraint violation rates by over 40% while maintaining or improving cumulative returns, thereby substantially overcoming the long-standing bottleneck in joint optimization of safety and performance.

Technology Category

Application Category

📝 Abstract

In constrained Markov decision processes, enforcing constraints during training is often thought of as decreasing the final return. Recently, it was shown that constraints can be incorporated directly in the policy geometry, yielding an optimization trajectory close to the central path of a barrier method, which does not compromise final return. Building on this idea, we introduce Central Path Proximal Policy Optimization (C3PO), a simple modification of PPO that produces policy iterates, which stay close to the central path of the constrained optimization problem. Compared to existing on-policy methods, C3PO delivers improved performance with tighter constraint enforcement, suggesting that central path-guided updates offer a promising direction for constrained policy optimization.

Problem

Research questions and friction points this paper is trying to address.

Enforcing constraints in training without reducing final return

Incorporating constraints directly into policy geometry

Improving performance with tighter constraint enforcement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modifies PPO for constrained optimization

Uses central path-guided policy updates

Ensures tight constraint enforcement

🔎 Similar Papers

No similar papers found.