๐ค AI Summary
This work investigates the impact of entropy regularization on the convergence of policy gradient methods in stochastic exit-time control. We propose a continuous-time policy mirror descent dynamics, where entropy regularization strength is gradually annealed to enable a smooth transition from regularized solutions to the unregularized optimal policy. We establish, for the first time, a convergence rate theory for entropy-annealed mirror descent in the infinite-dimensional space of Markov kernels, revealing how entropy regularization fundamentally accelerates true gradient optimization. Theoretically, we prove exponential convergence under fixed entropy; under polynomial entropy decay, the method achieves $O(1/S)$ convergence in discrete action spaces and $O(1/sqrt{S})$ in general (continuous) action spaces. Our key innovation lies in integrating entropy annealing with infinite-dimensional variational optimization, yielding the first mirror descent framework for non-convex stochastic control problems with explicit, provable convergence rates.
๐ Abstract
Entropy regularization has been widely used in policy optimization algorithms to enhance exploration and the robustness of the optimal control; however it also introduces an additional regularization bias. This work quantifies the impact of entropy regularization on the convergence of policy gradient methods for stochastic exit time control problems. We analyze a continuous-time policy mirror descent dynamics, which updates the policy based on the gradient of an entropy-regularized value function and adjusts the strength of entropy regularization as the algorithm progresses. We prove that with a fixed entropy level, the mirror descent dynamics converges exponentially to the optimal solution of the regularized problem. We further show that when the entropy level decays at suitable polynomial rates, the annealed flow converges to the solution of the unregularized problem at a rate of $mathcal O(1/S)$ for discrete action spaces and, under suitable conditions, at a rate of $mathcal O(1/sqrt{S})$ for general action spaces, with $S$ being the gradient flow running time. The technical challenge lies in analyzing the gradient flow in the infinite-dimensional space of Markov kernels for nonconvex objectives. This paper explains how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate.