🤖 AI Summary
This work addresses the lack of global convergence theory for Wasserstein policy gradient (WPG) methods in entropy-regularized reinforcement learning by introducing a novel analytical framework grounded in the Bellman structure. By leveraging a Polyak–Łojasiewicz-type geometric condition induced by the soft Bellman recursion, and integrating tools including KL-divergence-based Bellman residuals, Bellman contraction, Fisher information, and uniform log-Sobolev inequalities, the framework effectively controls regularity and discretization errors during policy iteration. For the first time, the study establishes global convergence guarantees for WPG in non-convex continuous action spaces and proves a geometric convergence rate that accounts for discretization bias, thereby revealing the favorable optimization geometry inherently present in entropy-regularized reinforcement learning.
📝 Abstract
Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of the soft Q-function together with a Langevin-type diffusion. Despite its appeal for continuous-control problems, its global convergence properties remain poorly understood. Standard Langevin analyses do not directly apply, because the RL objective depends on the policy through the Bellman recursion rather than through a static convex functional, and the Langevin drift is determined by the soft Q-function, whose regularity must be controlled along the policy iterates.
In this paper, we develop a global convergence theory for WPG by exploiting the Bellman structure of entropy-regularized RL. We show that the role usually played by convexity can be replaced by a Bellman-based argument: the soft Bellman residual admits a statewise KL representation with respect to a Gibbs policy; Bellman contraction relates this residual to the global optimality gap; and a Bellman resolvent identity connects value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality (LSI) for the evolving Gibbs family, these ingredients yield a distributional Polyak--Łojasiewicz condition. We further establish the regularity and uniform bounds needed to control the discretization error, thereby obtaining geometric contraction up to a discretization bias. Conceptually, our analysis shows that although entropy-regularized RL is not convex in the usual flat sense, the Bellman recursion induces a favorable Polyak--Lojasiewicz-type (PL) geometry that supports global convergence of WPG.