Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Traditional policy optimization in reinforcement learning maximizes only the expected return, yielding converged policies that are uninterpretable and lack behavioral diversity. To address this, we propose a temperature-decoupled entropy-regularized framework that ensures simultaneous convergence of the policy, value function, and return distribution, while producing an interpretable, structurally explicit optimal policy: as the regularization temperature is progressively annealed to zero, the policy converges to a deterministic-stochastic hybrid that assigns uniform probability over all optimal actions. Leveraging distributional RL theory and dynamic programming analysis, we establish a joint convergence theorem for the policy, Q-function, and return distribution, and provide a method to estimate the optimal return distribution to arbitrary precision. Our key innovation lies in the temperature-decoupling mechanism, which rigorously reconciles exploration and convergence—first achieving provable interpretability and sustained action diversity in entropy-regularized RL.

Technology Category

Application Category

📝 Abstract

In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart from their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.

Problem

Research questions and friction points this paper is trying to address.

Guaranteeing convergence to specific optimal policies in reinforcement learning

Preserving policy diversity and interpretability through entropy regularization

Ensuring convergence of value functions and return distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vanishing entropy regularization for policy optimization

Temperature decoupling ensures convergence guarantees

Uniformly samples optimal actions preserving diversity

🔎 Similar Papers

The Benefits of Being Categorical Distributional: Uncertainty-aware Regularized Exploration in Reinforcement Learning