🤖 AI Summary
This work addresses the security threat posed by backdoored reinforcement learning models trained by third parties, which can exhibit malicious behavior under specific trigger conditions. The paper introduces, for the first time, Monte Carlo Tree Search (MCTS) into backdoor defense and proposes a test-time defense framework that requires no retraining and operates solely with black-box access to the policy. By leveraging systematic exploration and proactive replanning, the method effectively identifies and neutralizes temporally triggered backdoor attacks. Evaluated in an O-RAN scenario, the approach improves backdoor detection success rates by 61.4 percentage points; in a Humanoid adversarial environment, it increases win rates from 35% to 53%. These results demonstrate a significant enhancement in the safety and robustness of deployed reinforcement learning models.
📝 Abstract
Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35\% to 53\% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at https://github.com/rl-bandits-lab/RL-Backdoor.