Adversarial Inception for Bounded Backdoor Poisoning in Deep Reinforcement Learning

πŸ“… 2024-10-17
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing backdoor attacks against deep reinforcement learning (DRL) fail under strict reward constraints, as they typically rely on arbitrary reward manipulation to establish trigger-action associations. Method: This paper proposes β€œInception,” a novel attack paradigm based on action-execution decoupling: it implicitly couples triggers with high-reward actions during training without modifying rewards, achieving efficient backdoor injection under bounded reward perturbations. The approach integrates online adversarial training, action-space misalignment injection, and policy-environment interaction modeling. Contribution/Results: We provide theoretical guarantees that Inception simultaneously preserves primary task performance and ensures reliable backdoor activation. Evaluated on multiple DRL benchmarks, Inception achieves significantly higher attack success rates than state-of-the-art methods, reduces required reward perturbations by over 90%, enhances stealth, and maintains original task performance without degradation.

Technology Category

Application Category

πŸ“ Abstract
Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. These attacks induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks rely on arbitrarily large perturbations to the agent's rewards to achieve both of these objectives - leaving them open to detection. Thus, in this work, we propose a new class of backdoor attacks against DRL which achieve state of the art performance while minimally altering the agent's rewards. These"inception"attacks train the agent to associate the targeted adversarial behavior with high returns by inducing a disjunction between the agent's chosen action and the true action executed in the environment during training. We formally define these attacks and prove they can achieve both adversarial objectives. We then devise an online inception attack which significantly out-performs prior attacks under bounded reward constraints.
Problem

Research questions and friction points this paper is trying to address.

Vulnerability of DRL to training-time backdoor poisoning attacks
Achieving adversarial goals under strict reward constraints
Manipulating training data to induce targeted adversarial behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manipulate training data with triggers
Replace high return actions adversarially
Achieve 100% attack success rate
πŸ”Ž Similar Papers
No similar papers found.