Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a black-box adversarial attack framework for safe reinforcement learning that operates without access to the target policy’s gradients or true safety constraints. Addressing the vulnerability of existing safe reinforcement learning methods under adversarial perturbations—particularly in realistic black-box settings where gradient-based attacks are infeasible—the approach leverages inverse constrained reinforcement learning to jointly infer an agent’s policy and constraint model through expert demonstrations and environment interactions. This is the first method to expose the fragility of safe policies under fully black-box conditions. The framework includes a theoretical analysis of perturbation bounds and demonstrates strong attack efficacy across multiple safe reinforcement learning benchmarks, revealing significant policy vulnerabilities even under highly restricted access scenarios.

Technology Category

Application Category

📝 Abstract
Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.
Problem

Research questions and friction points this paper is trying to address.

Safe Reinforcement Learning
Adversarial Perturbations
Black-box Attack
Safety Constraints
Vulnerability Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial attack
safe reinforcement learning
inverse constrained reinforcement learning
black-box optimization
constraint modeling
🔎 Similar Papers
No similar papers found.
J
Jialiang Fan
University of Notre Dame
S
Shixiong Jiang
University of Notre Dame
M
Mengyu Liu
Washington State University Tri-Cities
Fanxin Kong
Fanxin Kong
University of Notre Dame
Cyber-Physical SystemsSecurity/Safety/AssuranceMachine Learning/Foundation ModelFormal Methods