Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work investigates the robustness of Decision Pre-trained Transformers (DPTs) in In-Context Reinforcement Learning (ICRL) against reward poisoning attacks. To address the limitation of existing methods—namely, their inability to withstand adaptive, learning-based attackers—we propose AT-DPT, the first adversarial training framework tailored for ICRL. AT-DPT jointly optimizes an attacker model and a policy model to enable robust action inference under poisoned reward contexts. Methodologically, we introduce a two-player game formulation into ICRL, establishing a Transformer-based adversarial training paradigm coupled with an online interactive evaluation framework. Experiments across bandit and Markov Decision Process (MDP) benchmarks demonstrate that AT-DPT significantly outperforms state-of-the-art robust RL algorithms. Moreover, its robustness generalizes to dynamic sequential decision-making settings, overcoming key constraints of conventional robust RL—such as restrictive attack assumptions and fixed environment configurations.

Technology Category

Application Category

📝 Abstract

We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained Decision-Pretrained Transformer (AT-DPT). Our method simultaneously trains an attacker to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that the proposed method significantly outperforms these baselines in bandit settings, under a learned attacker. We additionally evaluate AT-DPT on an adaptive attacker, and observe similar results. Furthermore, we extend our evaluation to the MDP setting, confirming that the robustness observed in bandit scenarios generalizes to more complex environments.

Problem

Research questions and friction points this paper is trying to address.

Study corruption-robustness of in-context reinforcement learning (ICRL).

Propose adversarial training framework to counter reward poisoning attacks.

Evaluate robustness in bandit and MDP settings against attackers.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training for reward poisoning defense

Simultaneous attacker and DPT model training

Robustness in bandit and MDP settings

🔎 Similar Papers

Data Poisoning for In-context Learning