Constrained Policy Optimization via Sampling-Based Weight-Space Projection

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses safety-critical reinforcement learning, where policy parameters must satisfy unknown rollout-based safety constraints throughout training. We propose a sampling-driven weight-space projection mechanism that avoids reliance on constraint gradients, integrating trajectory sampling, parameter-performance smoothness bound estimation, second-order cone programming (SOCP)-based projection, and stabilization-aware backup policy design. Theoretically, we establish safety-inductive guarantees and closed-loop stability, enabling adaptive safety enforcement beyond conservative backup policies. Experimentally, on a double-integrator task under adversarial expert interference and harmful supervised regression, our method maintains constraint feasibility throughout training, rejects all unsafe parameter updates, and significantly improves primary task performance.

Technology Category

Application Category

📝 Abstract
Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy unknown, rollout-based safety constraints. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. Our approach constructs a local safe region by combining trajectory rollouts with smoothness bounds that relate parameter changes to shifts in safety metrics. Each gradient update is then projected via a convex SOCP, producing a safe first-order step. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, our approach further ensures closed-loop stability and enables safe adaptation beyond the conservative backup. On regression with harmful supervision and a constrained double-integrator task with malicious expert, our approach consistently rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful primal objective improvement.
Problem

Research questions and friction points this paper is trying to address.

Enforces safety constraints in policy optimization without gradient access
Projects gradient updates via convex optimization to maintain safe policies
Ensures closed-loop stability and safe adaptation in constrained control settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sampling-based weight-space projection for safety
Convex SOCP projection for safe gradient updates
Safe-by-induction guarantee from safe initialization
🔎 Similar Papers
No similar papers found.
S
Shengfan Cao
Department of Mechanical Engineering, University of California at Berkeley, CA 94701 USA
Francesco Borrelli
Francesco Borrelli
Professor of Controls, UC Berkeley, CA
ControlsLearningAutonomyEnergy Efficient Control System