🤖 AI Summary
This work addresses the lack of a unified theoretical framework for understanding the convergence and optimization mechanisms of natural policy gradient (NPG) algorithms. We propose the Doubly Smoothed Policy Iteration (DSPI) framework, which, for the first time, embeds NPG within a policy iteration scheme viewed through smoothed Bellman operators, interpreting it as a regularized greedy update applied to a weighted average of historical Q-functions. Without modifying the underlying MDP or introducing explicit regularization, our framework establishes global geometric convergence of NPG under general settings, encompassing linear function approximation and stochastic shortest path problems. Leveraging the monotonicity and contraction properties of smoothed Bellman operators, together with mirror mapping and dual averaging techniques, we develop a unified analysis that proves both standard NPG and policy dual averaging achieve an iteration complexity of $O((1-\gamma)^{-1} \log((1-\gamma)^{-1}\varepsilon^{-1}))$ to obtain an $\varepsilon$-optimal policy, and provide finite-step termination guarantees even without regularization.
📝 Abstract
In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-γ)^{-1}\log((1-γ)^{-1}ε^{-1}))$ for computing an $ε$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.