🤖 AI Summary
This paper investigates the global convergence of policy gradient (PG) methods in finite-armed bandits under linear function approximation. It identifies a fundamental limitation of conventional convergence analyses—namely, their reliance on approximation error—and reveals that convergence fundamentally hinges on the ability of the policy update and function representation to preserve the relative ordering of action rewards. The work establishes, for the first time, necessary and sufficient convergence conditions based on reward-order preservation: natural policy gradient (NPG) converges globally if and only if the linear projection preserves the relative order of optimal actions; softmax PG additionally requires non-dominance and order preservation. These results are rigorously proved via projection geometry and policy gradient theory, and empirically validated. The core contribution is a reformulation of PG convergence criteria that decouples representational capacity from optimization dynamics, providing a novel theoretical foundation for trustworthy function-approximation-based reinforcement learning.
📝 Abstract
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: extbf{(i)} Global convergence can be achieved under linear function approximation without policy or reward realizability, both for the standard Softmax PG and natural policy gradient (NPG). extbf{(ii)} Approximation error is not a key quantity for characterizing global convergence in either algorithm. extbf{(iii)} The conditions on the representation that imply global convergence are different between these two algorithms. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation. extcolor{blue}{Second}, motivated by these observations, we establish new general results: extbf{(i)} NPG with linear function approximation achieves global convergence emph{if and only if} the projection of the reward onto the representable space preserves the optimal action's rank, a quantity that is not strongly related to approximation error. extbf{(ii)} The global convergence of Softmax PG occurs if the representation satisfies a non-domination condition and can preserve the ranking of rewards, which goes well beyond policy or reward realizability. We provide experimental results to support these theoretical findings.