🤖 AI Summary
Traditional policy optimization in reinforcement learning maximizes only the expected return, yielding converged policies that are uninterpretable and lack behavioral diversity. To address this, we propose a temperature-decoupled entropy-regularized framework that ensures simultaneous convergence of the policy, value function, and return distribution, while producing an interpretable, structurally explicit optimal policy: as the regularization temperature is progressively annealed to zero, the policy converges to a deterministic-stochastic hybrid that assigns uniform probability over all optimal actions. Leveraging distributional RL theory and dynamic programming analysis, we establish a joint convergence theorem for the policy, Q-function, and return distribution, and provide a method to estimate the optimal return distribution to arbitrary precision. Our key innovation lies in the temperature-decoupling mechanism, which rigorously reconciles exploration and convergence—first achieving provable interpretability and sustained action diversity in entropy-regularized RL.
📝 Abstract
In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart from their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.