🤖 AI Summary
This work investigates the robustness of Decision Pre-trained Transformers (DPTs) in In-Context Reinforcement Learning (ICRL) against reward poisoning attacks. To address the limitation of existing methods—namely, their inability to withstand adaptive, learning-based attackers—we propose AT-DPT, the first adversarial training framework tailored for ICRL. AT-DPT jointly optimizes an attacker model and a policy model to enable robust action inference under poisoned reward contexts. Methodologically, we introduce a two-player game formulation into ICRL, establishing a Transformer-based adversarial training paradigm coupled with an online interactive evaluation framework. Experiments across bandit and Markov Decision Process (MDP) benchmarks demonstrate that AT-DPT significantly outperforms state-of-the-art robust RL algorithms. Moreover, its robustness generalizes to dynamic sequential decision-making settings, overcoming key constraints of conventional robust RL—such as restrictive attack assumptions and fixed environment configurations.
📝 Abstract
We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained Decision-Pretrained Transformer (AT-DPT). Our method simultaneously trains an attacker to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that the proposed method significantly outperforms these baselines in bandit settings, under a learned attacker. We additionally evaluate AT-DPT on an adaptive attacker, and observe similar results. Furthermore, we extend our evaluation to the MDP setting, confirming that the robustness observed in bandit scenarios generalizes to more complex environments.