Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This paper investigates the global convergence of Softmax Policy Gradient (SPG) with linear function approximation (Lin-SPG). Addressing the long-standing question of whether approximation error impedes global convergence, the authors establish, for the first time, that the asymptotic global convergence of Lin-SPG is *independent* of policy or value function approximation errors and is *entirely determined* by the geometric structure of the feature representation. They derive necessary and sufficient conditions on the feature matrix for global convergence and, based on these, provide asymptotically optimal convergence guarantees for both stochastic bandits and general Markov decision processes (MDPs). Specifically, under any fixed constant learning rate, Lin-SPG achieves an $O(1/T)$ convergence rate after $T$ iterations and converges to the globally optimal policy. This work delivers a fundamental theoretical characterization and rigorous foundation for policy gradient methods in the linear approximation setting.

Technology Category

Application Category

📝 Abstract

Policy gradient (PG) methods have played an essential role in the empirical successes of reinforcement learning. In order to handle large state-action spaces, PG methods are typically used with function approximation. In this setting, the approximation error in modeling problem-dependent quantities is a key notion for characterizing the global convergence of PG methods. We focus on Softmax PG with linear function approximation (referred to as $ exttt{Lin-SPG}$) and demonstrate that the approximation error is irrelevant to the algorithm's global convergence even for the stochastic bandit setting. Consequently, we first identify the necessary and sufficient conditions on the feature representation that can guarantee the asymptotic global convergence of $ exttt{Lin-SPG}$. Under these feature conditions, we prove that $T$ iterations of $ exttt{Lin-SPG}$ with a problem-specific learning rate result in an $O(1/T)$ convergence to the optimal policy. Furthermore, we prove that $ exttt{Lin-SPG}$ with any arbitrary constant learning rate can ensure asymptotic global convergence to the optimal policy.

Problem

Research questions and friction points this paper is trying to address.

Analyzing global convergence of Softmax PG with linear approximation

Identifying feature conditions for asymptotic convergence of Lin-SPG

Proving O(1/T) convergence rate with problem-specific learning rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Softmax PG with linear function approximation

Irrelevant approximation error for convergence

Optimal policy convergence with constant learning rate

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL