🤖 AI Summary
This work proposes a black-box adversarial attack framework for safe reinforcement learning that operates without access to the target policy’s gradients or true safety constraints. Addressing the vulnerability of existing safe reinforcement learning methods under adversarial perturbations—particularly in realistic black-box settings where gradient-based attacks are infeasible—the approach leverages inverse constrained reinforcement learning to jointly infer an agent’s policy and constraint model through expert demonstrations and environment interactions. This is the first method to expose the fragility of safe policies under fully black-box conditions. The framework includes a theoretical analysis of perturbation bounds and demonstrates strong attack efficacy across multiple safe reinforcement learning benchmarks, revealing significant policy vulnerabilities even under highly restricted access scenarios.
📝 Abstract
Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.