Exposing Vulnerabilities in RL: A Novel Stealthy Backdoor Attack through Reward Poisoning

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a novel, stealthy security threat to reinforcement learning (RL) systems during training: backdoor attacks via reward signal poisoning. Unlike conventional attacks, this method embeds conditional triggers into the reward function to induce malicious policy behavior under specific inputs, while preserving near-normal performance on benign tasks. The authors propose a lightweight, transferable reward function manipulation framework and validate it on Hopper and Walker2D benchmarks. Under non-triggered conditions, agent performance degrades by only 2.18% and 4.59%, respectively; under triggered conditions, policy failure rates reach 82.31% and 71.27%. This study is the first to systematically characterize the feasibility and stealthiness of reward-poisoning backdoors in RL. It establishes a critical benchmark for RL security and opens new research directions for robustness-aware reward design and backdoor detection in sequential decision-making systems.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has achieved remarkable success across diverse domains, enabling autonomous systems to learn and adapt to dynamic environments by optimizing a reward function. However, this reliance on reward signals creates a significant security vulnerability. In this paper, we study a stealthy backdoor attack that manipulates an agent's policy by poisoning its reward signals. The effectiveness of this attack highlights a critical threat to the integrity of deployed RL systems and calls for urgent defenses against training-time manipulation. We evaluate the attack across classic control and MuJoCo environments. The backdoored agent remains highly stealthy in Hopper and Walker2D, with minimal performance drops of only 2.18 % and 4.59 % under non-triggered scenarios, while achieving strong attack efficacy with up to 82.31% and 71.27% declines under trigger conditions.
Problem

Research questions and friction points this paper is trying to address.

Exposing vulnerabilities in reinforcement learning systems
Studying stealthy backdoor attacks via reward poisoning
Highlighting threats to RL system integrity and defenses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stealthy backdoor attack via reward poisoning
Manipulates agent policy by corrupting reward signals
Evaluated in control and MuJoCo environments
🔎 Similar Papers
No similar papers found.
B
Bokang Zhang
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
C
Chaojun Lu
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
J
Jianhui Li
College of Control Science and Engineering, Zhejiang University, China
Junfeng Wu
Junfeng Wu
Huazhong University of Science and Technology
Computer Vision