Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
Standard policy gradient methods often converge to suboptimal stationary points within restricted policy classes due to their reliance on single-step Q-function updates. This work proposes a generalized k-step policy gradient method that overcomes such myopia and avoids distribution mismatch by coupling stochasticity across a k-step temporal window. Theoretically, we establish—for the first time—that this approach exponentially approaches the performance of the optimal deterministic policy, requiring only smoothness and differentiability of the value function. By integrating projected gradient and mirror descent techniques, the algorithm achieves an exponentially near-optimal solution within O(1/T) iterations, demonstrating broad applicability to complex settings such as state aggregation and partially observable multi-agent coordination.
📝 Abstract
This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$-function. In this work, we propose a generalized $k$-step policy gradient method that couples the randomness within a $k$-step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$. Further, we show projected gradient descent and mirror descent with this $k$-step policy gradient can achieve this exponential guarantee in $O(\frac{1}{T})$ iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors $||d_μ^{π^*} / d_μ^π||_\infty$ and $||d_μ^{π^*} / μ||_\infty$ enabling the $k$-step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.
Problem

Research questions and friction points this paper is trying to address.

policy gradient
restricted policy classes
myopic local optima
suboptimal critical points
Markov decision processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

k-step policy gradient
myopic local optima
restricted policy classes
distribution mismatch
exponential convergence
🔎 Similar Papers
No similar papers found.