Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work establishes the global convergence of entropy-regularized policy gradient methods in Markov decision processes with continuous state and action spaces. Focusing on infinite-horizon settings and employing log-linear softmax policies with linear function approximation, the study extends— for the first time—the global linear convergence guarantees known for tabular MDPs to the continuous domain. By constructing a non-uniform Polyak–Łojasiewicz inequality and leveraging properties of the Fisher information matrix, the non-centered feature covariance, and the radial unboundedness of the KL regularization term, the authors prove that under two distinct structural conditions on the feature representation, the policy gradient flow drives the objective function to converge exponentially fast—at a rate of 𝒪(e⁻ᶜᵗ)—to the global optimum.

📝 Abstract

We study the global convergence of policy gradient for infinite-horizon entropy-regularized Markov decision processes (MDPs) with continuous state and action spaces. We consider log-linear softmax policies with linear function approximation, which extend the tabular softmax parameterization while retaining a tractable policy class. Under $Q^π_τ$-realizability for the regularized state-action value function, we first establish a non-uniform Polyak--Łojasiewicz (PŁ) inequality. The non-uniformity arises through degeneracy of constants associated with the policy geometry, namely the Fisher information matrix or an uncentered feature covariance matrix. We then identify two feature regimes under which this non-uniform constant can be bounded along the gradient flow. For full-affine-span features, we prove radial unboundedness of the KL regularizer and show that the smallest eigenvalue of the Fisher information matrix remains bounded below by an initialization-dependent positive constant. For simplex-valued features, we prove an analogous radial unboundedness result in the subspace orthogonal to the all-ones vector and obtain a uniform lower bound for the smallest eigenvalue of the uncentered covariance matrix. These results imply global linear convergence of the regularized objective along the gradient flow, i.e. suboptimality decaying as $\mathcal{O}(e^{-Ct})$ for some $C>0$. Our analysis extends the global convergence theory of entropy-regularized softmax policy gradient beyond the tabular setting of Agarwal et al. (2020); Bhandari and Russo (2024); Mei et al. (2020).

Problem

Research questions and friction points this paper is trying to address.

policy gradient

entropy regularization

global convergence

continuous MDPs

softmax policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy-regularized policy gradient

global linear convergence

continuous MDPs