Exposing Vulnerabilities in RL: A Novel Stealthy Backdoor Attack through Reward Poisoning

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work identifies a novel, stealthy security threat to reinforcement learning (RL) systems during training: backdoor attacks via reward signal poisoning. Unlike conventional attacks, this method embeds conditional triggers into the reward function to induce malicious policy behavior under specific inputs, while preserving near-normal performance on benign tasks. The authors propose a lightweight, transferable reward function manipulation framework and validate it on Hopper and Walker2D benchmarks. Under non-triggered conditions, agent performance degrades by only 2.18% and 4.59%, respectively; under triggered conditions, policy failure rates reach 82.31% and 71.27%. This study is the first to systematically characterize the feasibility and stealthiness of reward-poisoning backdoors in RL. It establishes a critical benchmark for RL security and opens new research directions for robustness-aware reward design and backdoor detection in sequential decision-making systems.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has achieved remarkable success across diverse domains, enabling autonomous systems to learn and adapt to dynamic environments by optimizing a reward function. However, this reliance on reward signals creates a significant security vulnerability. In this paper, we study a stealthy backdoor attack that manipulates an agent's policy by poisoning its reward signals. The effectiveness of this attack highlights a critical threat to the integrity of deployed RL systems and calls for urgent defenses against training-time manipulation. We evaluate the attack across classic control and MuJoCo environments. The backdoored agent remains highly stealthy in Hopper and Walker2D, with minimal performance drops of only 2.18 % and 4.59 % under non-triggered scenarios, while achieving strong attack efficacy with up to 82.31% and 71.27% declines under trigger conditions.

Problem

Research questions and friction points this paper is trying to address.

Exposing vulnerabilities in reinforcement learning systems

Studying stealthy backdoor attacks via reward poisoning

Highlighting threats to RL system integrity and defenses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stealthy backdoor attack via reward poisoning

Manipulates agent policy by corrupting reward signals

Evaluated in control and MuJoCo environments

🔎 Similar Papers

Is poisoning a real threat to LLM alignment? Maybe more so than you think