SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the critical challenge of preserving safety guarantees for previously learned tasks when updating reinforcement learning policies in non-stationary environments or under task shifts. The authors propose a provably safe policy update method that constructs a Rashomon set in policy parameter space—comprising all policies that satisfy safety constraints on the source task—and projects updated policies onto this set during adaptation. By formally integrating the Rashomon set into reinforcement learning for the first time, the approach provides a priori safety assurances for arbitrary policy updates, overcoming limitations of post-hoc verification or methods lacking theoretical guarantees. Empirical results on Frozen Lake and Poisoned Apple benchmarks demonstrate that the method enables downstream adaptation with deterministic safety on the source task, significantly outperforming regularization-based baselines and effectively mitigating catastrophic forgetting of safety constraints.

Technology Category

Application Category

📝 Abstract

Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.

Problem

Research questions and friction points this paper is trying to address.

safe policy updates

reinforcement learning

safety guarantees

non-stationary dynamics

continual learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

SafeAdapt

Rashomon set

provably safe