🤖 AI Summary
This paper addresses the problem of *execution stability*: policy deployment dynamically alters the reward function and state-transition dynamics of the underlying MDP, rendering standard regularized objectives non-strongly-convex. For linear MDPs—particularly those with infinite state spaces—we propose the first scalable theoretical framework for execution stability. Methodologically, we introduce a novel recursive relationship among dual variables into convergence analysis, integrating empirical Lagrangian construction with primal-dual optimization. Our analysis yields a finite-sample error bound that depends solely on feature dimension—not state cardinality. Moreover, we provide the first saddle-point optimization guarantee for infinite-state linear MDPs under a coverage condition. Theoretically, our algorithm is proven to converge to an execution-stable policy, establishing a rigorous foundation for dynamic environments such as multi-agent systems.
📝 Abstract
We study the setting of emph{performative reinforcement learning} where the deployed policy affects both the reward, and the transition of the underlying Markov decision process. Prior work~parencite{MTR23} has addressed this problem under the tabular setting and established last-iterate convergence of repeated retraining with iteration complexity explicitly depending on the number of states. In this work, we generalize the results to emph{linear Markov decision processes} which is the primary theoretical model of large-scale MDPs. The main challenge with linear MDP is that the regularized objective is no longer strongly convex and we want a bound that scales with the dimension of the features, rather than states which can be infinite. Our first result shows that repeatedly optimizing a regularized objective converges to a emph{performatively stable policy}. In the absence of strong convexity, our analysis leverages a new recurrence relation that uses a specific linear combination of optimal dual solutions for proving convergence. We then tackle the finite sample setting where the learner has access to a set of trajectories drawn from the current policy. We consider a reparametrized version of the primal problem, and construct an empirical Lagrangian which is to be optimized from the samples. We show that, under a emph{bounded coverage} condition, repeatedly solving a saddle point of this empirical Lagrangian converges to a performatively stable solution, and also construct a primal-dual algorithm that solves the empirical Lagrangian efficiently. Finally, we show several applications of the general framework of performative RL including multi-agent systems.