Adversarial Inception for Bounded Backdoor Poisoning in Deep Reinforcement Learning

📅 2024-10-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing backdoor attacks against deep reinforcement learning (DRL) fail under strict reward constraints, as they typically rely on arbitrary reward manipulation to establish trigger-action associations. Method: This paper proposes “Inception,” a novel attack paradigm based on action-execution decoupling: it implicitly couples triggers with high-reward actions during training without modifying rewards, achieving efficient backdoor injection under bounded reward perturbations. The approach integrates online adversarial training, action-space misalignment injection, and policy-environment interaction modeling. Contribution/Results: We provide theoretical guarantees that Inception simultaneously preserves primary task performance and ensures reliable backdoor activation. Evaluated on multiple DRL benchmarks, Inception achieves significantly higher attack success rates than state-of-the-art methods, reduces required reward perturbations by over 90%, enhances stealth, and maintains original task performance without degradation.

Technology Category

Application Category

📝 Abstract

Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. These attacks induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks rely on arbitrarily large perturbations to the agent's rewards to achieve both of these objectives - leaving them open to detection. Thus, in this work, we propose a new class of backdoor attacks against DRL which achieve state of the art performance while minimally altering the agent's rewards. These"inception"attacks train the agent to associate the targeted adversarial behavior with high returns by inducing a disjunction between the agent's chosen action and the true action executed in the environment during training. We formally define these attacks and prove they can achieve both adversarial objectives. We then devise an online inception attack which significantly out-performs prior attacks under bounded reward constraints.

Problem

Research questions and friction points this paper is trying to address.

Vulnerability of DRL to training-time backdoor poisoning attacks

Achieving adversarial goals under strict reward constraints

Manipulating training data to induce targeted adversarial behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manipulate training data with triggers

Replace high return actions adversarially

Achieve 100% attack success rate

🔎 Similar Papers

No similar papers found.