🤖 AI Summary
This work identifies a novel, stealthy security threat to reinforcement learning (RL) systems during training: backdoor attacks via reward signal poisoning. Unlike conventional attacks, this method embeds conditional triggers into the reward function to induce malicious policy behavior under specific inputs, while preserving near-normal performance on benign tasks. The authors propose a lightweight, transferable reward function manipulation framework and validate it on Hopper and Walker2D benchmarks. Under non-triggered conditions, agent performance degrades by only 2.18% and 4.59%, respectively; under triggered conditions, policy failure rates reach 82.31% and 71.27%. This study is the first to systematically characterize the feasibility and stealthiness of reward-poisoning backdoors in RL. It establishes a critical benchmark for RL security and opens new research directions for robustness-aware reward design and backdoor detection in sequential decision-making systems.
📝 Abstract
Reinforcement learning (RL) has achieved remarkable success across diverse domains, enabling autonomous systems to learn and adapt to dynamic environments by optimizing a reward function. However, this reliance on reward signals creates a significant security vulnerability. In this paper, we study a stealthy backdoor attack that manipulates an agent's policy by poisoning its reward signals. The effectiveness of this attack highlights a critical threat to the integrity of deployed RL systems and calls for urgent defenses against training-time manipulation. We evaluate the attack across classic control and MuJoCo environments. The backdoored agent remains highly stealthy in Hopper and Walker2D, with minimal performance drops of only 2.18 % and 4.59 % under non-triggered scenarios, while achieving strong attack efficacy with up to 82.31% and 71.27% declines under trigger conditions.