π€ AI Summary
Existing backdoor attacks against deep reinforcement learning (DRL) fail under strict reward constraints, as they typically rely on arbitrary reward manipulation to establish trigger-action associations. Method: This paper proposes βInception,β a novel attack paradigm based on action-execution decoupling: it implicitly couples triggers with high-reward actions during training without modifying rewards, achieving efficient backdoor injection under bounded reward perturbations. The approach integrates online adversarial training, action-space misalignment injection, and policy-environment interaction modeling. Contribution/Results: We provide theoretical guarantees that Inception simultaneously preserves primary task performance and ensures reliable backdoor activation. Evaluated on multiple DRL benchmarks, Inception achieves significantly higher attack success rates than state-of-the-art methods, reduces required reward perturbations by over 90%, enhances stealth, and maintains original task performance without degradation.
π Abstract
Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. These attacks induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks rely on arbitrarily large perturbations to the agent's rewards to achieve both of these objectives - leaving them open to detection. Thus, in this work, we propose a new class of backdoor attacks against DRL which achieve state of the art performance while minimally altering the agent's rewards. These"inception"attacks train the agent to associate the targeted adversarial behavior with high returns by inducing a disjunction between the agent's chosen action and the true action executed in the environment during training. We formally define these attacks and prove they can achieve both adversarial objectives. We then devise an online inception attack which significantly out-performs prior attacks under bounded reward constraints.