Ordering-based Conditions for Global Convergence of Policy Gradient Methods

📅 2025-04-02

🏛️ Neural Information Processing Systems

📈 Citations: 4

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This paper investigates the global convergence of policy gradient (PG) methods in finite-armed bandits under linear function approximation. It identifies a fundamental limitation of conventional convergence analyses—namely, their reliance on approximation error—and reveals that convergence fundamentally hinges on the ability of the policy update and function representation to preserve the relative ordering of action rewards. The work establishes, for the first time, necessary and sufficient convergence conditions based on reward-order preservation: natural policy gradient (NPG) converges globally if and only if the linear projection preserves the relative order of optimal actions; softmax PG additionally requires non-dominance and order preservation. These results are rigorously proved via projection geometry and policy gradient theory, and empirically validated. The core contribution is a reformulation of PG convergence criteria that decouples representational capacity from optimization dynamics, providing a novel theoretical foundation for trustworthy function-approximation-based reinforcement learning.

Technology Category

Application Category

📝 Abstract

We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: extbf{(i)} Global convergence can be achieved under linear function approximation without policy or reward realizability, both for the standard Softmax PG and natural policy gradient (NPG). extbf{(ii)} Approximation error is not a key quantity for characterizing global convergence in either algorithm. extbf{(iii)} The conditions on the representation that imply global convergence are different between these two algorithms. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation. extcolor{blue}{Second}, motivated by these observations, we establish new general results: extbf{(i)} NPG with linear function approximation achieves global convergence emph{if and only if} the projection of the reward onto the representable space preserves the optimal action's rank, a quantity that is not strongly related to approximation error. extbf{(ii)} The global convergence of Softmax PG occurs if the representation satisfies a non-domination condition and can preserve the ranking of rewards, which goes well beyond policy or reward realizability. We provide experimental results to support these theoretical findings.

Problem

Research questions and friction points this paper is trying to address.

Global convergence of policy gradient methods under linear approximation

Conditions on representation for convergence in NPG and Softmax PG

Role of reward projection and ranking preservation in convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global convergence without policy realizability

NPG convergence via optimal action rank

Softmax PG needs non-domination condition

🔎 Similar Papers

Strongly-Polynomial Time and Validation Analysis of Policy Gradient Methods