Optimal Rates of Convergence for Entropy Regularization in Discounted Markov Decision Processes

📅 2024-06-06

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work investigates the approximation error induced by entropy regularization in discounted infinite-horizon discrete Markov decision processes (MDPs). Addressing the conservatism of the classical $O( au)$ error bound, we establish the first tight exponential upper and lower bounds, proving that the optimal policy error converges at rate $e^{-c/ au}$, where $ au$ is the regularization strength—substantially improving upon linear bounds. Methodologically, we model the entropy-regularized solution as a gradient flow on a Riemannian manifold, revealing its implicit bias toward the generalized maximum-entropy optimal policy. By integrating KL-weighted error analysis with a generalized convex potential function framework, we further derive that the overall error of the regularized natural policy gradient method decays as $e^{-csqrt{ au}}$. Our results unify the geometric interpretation and convergence mechanism of entropy regularization, and extend naturally to general convex regularization settings.

Technology Category

Application Category

📝 Abstract

We study the error introduced by entropy regularization in infinite-horizon, discrete, discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength both in a weighted KL-divergence and in value with a problem-specific exponent. This is in contrast to previously known estimates, of the order $O( au)$, where $ au$ is the regularization strength. We provide a lower bound matching our upper bound up to a polynomial term, thereby characterizing the exponential convergence rate for entropy regularization. Our proof relies on the observation that the solutions of entropy-regularized Markov decision processes solve a gradient flow of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. This correspondence allows us to identify the limit of this gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of this gradient flow, which corresponds to a time-continuous version of the natural policy gradient method. We use our improved error estimates to show that for entropy-regularized natural policy gradient methods, the overall error decays exponentially in the square root of the number of iterations, improving over existing sublinear guarantees. Finally, we extend our analysis to settings beyond the entropy. In particular, we characterize the implicit bias regarding general convex potentials and their resulting generalized natural policy gradients.

Problem

Research questions and friction points this paper is trying to address.

Analyzes error from entropy regularization in discounted MDPs

Characterizes exponential convergence rate for entropy regularization

Extends analysis to general convex potentials in policy gradients

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exponential error decrease in entropy regularization

Gradient flow links regularization to policy optimization

Generalized analysis for convex potentials and gradients

🔎 Similar Papers

On shallow planning under partial observability