🤖 AI Summary
This work addresses safety-critical reinforcement learning, where policy parameters must satisfy unknown rollout-based safety constraints throughout training. We propose a sampling-driven weight-space projection mechanism that avoids reliance on constraint gradients, integrating trajectory sampling, parameter-performance smoothness bound estimation, second-order cone programming (SOCP)-based projection, and stabilization-aware backup policy design. Theoretically, we establish safety-inductive guarantees and closed-loop stability, enabling adaptive safety enforcement beyond conservative backup policies. Experimentally, on a double-integrator task under adversarial expert interference and harmful supervised regression, our method maintains constraint feasibility throughout training, rejects all unsafe parameter updates, and significantly improves primary task performance.
📝 Abstract
Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy unknown, rollout-based safety constraints. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. Our approach constructs a local safe region by combining trajectory rollouts with smoothness bounds that relate parameter changes to shifts in safety metrics. Each gradient update is then projected via a convex SOCP, producing a safe first-order step. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, our approach further ensures closed-loop stability and enables safe adaptation beyond the conservative backup. On regression with harmful supervision and a constrained double-integrator task with malicious expert, our approach consistently rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful primal objective improvement.