🤖 AI Summary
This work investigates the approximation error induced by entropy regularization in discounted infinite-horizon discrete Markov decision processes (MDPs). Addressing the conservatism of the classical $O( au)$ error bound, we establish the first tight exponential upper and lower bounds, proving that the optimal policy error converges at rate $e^{-c/ au}$, where $ au$ is the regularization strength—substantially improving upon linear bounds. Methodologically, we model the entropy-regularized solution as a gradient flow on a Riemannian manifold, revealing its implicit bias toward the generalized maximum-entropy optimal policy. By integrating KL-weighted error analysis with a generalized convex potential function framework, we further derive that the overall error of the regularized natural policy gradient method decays as $e^{-csqrt{ au}}$. Our results unify the geometric interpretation and convergence mechanism of entropy regularization, and extend naturally to general convex regularization settings.
📝 Abstract
We study the error introduced by entropy regularization in infinite-horizon, discrete, discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength both in a weighted KL-divergence and in value with a problem-specific exponent. This is in contrast to previously known estimates, of the order $O( au)$, where $ au$ is the regularization strength. We provide a lower bound matching our upper bound up to a polynomial term, thereby characterizing the exponential convergence rate for entropy regularization. Our proof relies on the observation that the solutions of entropy-regularized Markov decision processes solve a gradient flow of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. This correspondence allows us to identify the limit of this gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of this gradient flow, which corresponds to a time-continuous version of the natural policy gradient method. We use our improved error estimates to show that for entropy-regularized natural policy gradient methods, the overall error decays exponentially in the square root of the number of iterations, improving over existing sublinear guarantees. Finally, we extend our analysis to settings beyond the entropy. In particular, we characterize the implicit bias regarding general convex potentials and their resulting generalized natural policy gradients.