Backdoors in DRL: Four Environments Focusing on In-distribution Triggers

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the underexplored security threat of *in-distribution (ID) backdoor attacks* in deep reinforcement learning (DRL), where triggers are stealthily embedded within the natural data distribution—making them harder to detect than out-of-distribution (OOD) counterparts. We systematically investigate ID triggers that are both easily activated and highly evasive. To this end, we design and implement four customized RL environments—including LavaWorld—and inject ID backdoors derived directly from the intrinsic state distribution, requiring only minimal, unobtrusive data poisoning. Experiments demonstrate that while ID-triggered models exhibit slightly slower training convergence, they achieve significantly higher stealthiness and maintain strong robustness and consistent attack success across model-reuse scenarios. Our work establishes the first benchmark platform and empirical foundation for studying DRL backdoors grounded in natural, distribution-aligned trigger mechanisms.

Technology Category

Application Category

📝 Abstract

Backdoor attacks, or trojans, pose a security risk by concealing undesirable behavior in deep neural network models. Open-source neural networks are downloaded from the internet daily, possibly containing backdoors, and third-party model developers are common. To advance research on backdoor attack mitigation, we develop several trojans for deep reinforcement learning (DRL) agents. We focus on in-distribution triggers, which occur within the agent's natural data distribution, since they pose a more significant security threat than out-of-distribution triggers due to their ease of activation by the attacker during model deployment. We implement backdoor attacks in four reinforcement learning (RL) environments: LavaWorld, Randomized LavaWorld, Colorful Memory, and Modified Safety Gymnasium. We train various models, both clean and backdoored, to characterize these attacks. We find that in-distribution triggers can require additional effort to implement and be more challenging for models to learn, but are nevertheless viable threats in DRL even using basic data poisoning attacks.

Problem

Research questions and friction points this paper is trying to address.

Investigates backdoor attacks in deep reinforcement learning agents

Focuses on in-distribution triggers as significant security threats

Tests attacks in four RL environments to characterize risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed trojans for DRL agents

Focused on in-distribution triggers

Implemented in four RL environments

🔎 Similar Papers

Shortcuts Everywhere and Nowhere: Exploring Multi-Trigger Backdoor Attacks