Constrained Policy Optimization via Sampling-Based Weight-Space Projection

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses safety-critical reinforcement learning, where policy parameters must satisfy unknown rollout-based safety constraints throughout training. We propose a sampling-driven weight-space projection mechanism that avoids reliance on constraint gradients, integrating trajectory sampling, parameter-performance smoothness bound estimation, second-order cone programming (SOCP)-based projection, and stabilization-aware backup policy design. Theoretically, we establish safety-inductive guarantees and closed-loop stability, enabling adaptive safety enforcement beyond conservative backup policies. Experimentally, on a double-integrator task under adversarial expert interference and harmful supervised regression, our method maintains constraint feasibility throughout training, rejects all unsafe parameter updates, and significantly improves primary task performance.

Technology Category

Application Category

📝 Abstract

Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy unknown, rollout-based safety constraints. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. Our approach constructs a local safe region by combining trajectory rollouts with smoothness bounds that relate parameter changes to shifts in safety metrics. Each gradient update is then projected via a convex SOCP, producing a safe first-order step. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, our approach further ensures closed-loop stability and enables safe adaptation beyond the conservative backup. On regression with harmful supervision and a constrained double-integrator task with malicious expert, our approach consistently rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful primal objective improvement.

Problem

Research questions and friction points this paper is trying to address.

Enforces safety constraints in policy optimization without gradient access

Projects gradient updates via convex optimization to maintain safe policies

Ensures closed-loop stability and safe adaptation in constrained control settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sampling-based weight-space projection for safety

Convex SOCP projection for safe gradient updates

Safe-by-induction guarantee from safe initialization

🔎 Similar Papers

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate