🤖 AI Summary
This work studies robust policy learning in performative reinforcement learning (RL) under Huber’s ε-contamination model—where an ε-fraction of samples is arbitrarily corrupted by adversarial noise. Addressing policy-dependent environments—where both reward and transition dynamics adapt to the deployed policy—we introduce, for the first time, the ε-contamination framework into performative RL. We propose a robust repeated retraining framework that integrates a problem-specific robust gradient mean estimator with policy regularization and a convex–concave Lagrangian formulation to ensure convergence. Theoretically, we prove that the algorithm converges to an O(√ε)-approximate performatively stable policy. Empirically, it significantly outperforms non-robust baselines under contamination. Our key contribution is establishing the first robust statistical learning theory for performative RL, explicitly quantifying the trade-off between contamination level ε and convergence error.
📝 Abstract
In performative Reinforcement Learning (RL), an agent faces a policy-dependent environment: the reward and transition functions depend on the agent's policy. Prior work on performative RL has studied the convergence of repeated retraining approaches to a performatively stable policy. In the finite sample regime, these approaches repeatedly solve for a saddle point of a convex-concave objective, which estimates the Lagrangian of a regularized version of the reinforcement learning problem. In this paper, we aim to extend such repeated retraining approaches, enabling them to operate under corrupted data. More specifically, we consider Huber's ε-contamination model, where an ε fraction of data points is corrupted by arbitrary adversarial noise. We propose a repeated retraining approach based on convex-concave optimization under corrupted gradients and a novel problem-specific robust mean estimator for the gradients. We prove that our approach exhibits last-iterate convergence to an approximately stable policy, with the approximation error linear in √ε. We experimentally demonstrate the importance of accounting for corruption in performative reinforcement learning.