Performative Reinforcement Learning with Linear Markov Decision Process

📅 2024-11-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of *execution stability*: policy deployment dynamically alters the reward function and state-transition dynamics of the underlying MDP, rendering standard regularized objectives non-strongly-convex. For linear MDPs—particularly those with infinite state spaces—we propose the first scalable theoretical framework for execution stability. Methodologically, we introduce a novel recursive relationship among dual variables into convergence analysis, integrating empirical Lagrangian construction with primal-dual optimization. Our analysis yields a finite-sample error bound that depends solely on feature dimension—not state cardinality. Moreover, we provide the first saddle-point optimization guarantee for infinite-state linear MDPs under a coverage condition. Theoretically, our algorithm is proven to converge to an execution-stable policy, establishing a rigorous foundation for dynamic environments such as multi-agent systems.

Technology Category

Application Category

📝 Abstract
We study the setting of emph{performative reinforcement learning} where the deployed policy affects both the reward, and the transition of the underlying Markov decision process. Prior work~parencite{MTR23} has addressed this problem under the tabular setting and established last-iterate convergence of repeated retraining with iteration complexity explicitly depending on the number of states. In this work, we generalize the results to emph{linear Markov decision processes} which is the primary theoretical model of large-scale MDPs. The main challenge with linear MDP is that the regularized objective is no longer strongly convex and we want a bound that scales with the dimension of the features, rather than states which can be infinite. Our first result shows that repeatedly optimizing a regularized objective converges to a emph{performatively stable policy}. In the absence of strong convexity, our analysis leverages a new recurrence relation that uses a specific linear combination of optimal dual solutions for proving convergence. We then tackle the finite sample setting where the learner has access to a set of trajectories drawn from the current policy. We consider a reparametrized version of the primal problem, and construct an empirical Lagrangian which is to be optimized from the samples. We show that, under a emph{bounded coverage} condition, repeatedly solving a saddle point of this empirical Lagrangian converges to a performatively stable solution, and also construct a primal-dual algorithm that solves the empirical Lagrangian efficiently. Finally, we show several applications of the general framework of performative RL including multi-agent systems.
Problem

Research questions and friction points this paper is trying to address.

Study performative reinforcement learning with linear MDPs.
Address convergence without strong convexity in large-scale MDPs.
Develop efficient algorithms for performatively stable policies.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizes performative RL to linear MDPs
Uses empirical Lagrangian for sample optimization
Develops primal-dual algorithm for efficient convergence
🔎 Similar Papers
No similar papers found.