π€ AI Summary
Mirror Descent Value Iteration (MDVI) underperforms entropy-regularized methods in continuous action spaces. Method: We propose Mirror Descent Actor-Critic (MDAC), the first systematic integration of MDVI into the Actor-Critic framework. MDAC enhances policy optimization stability by explicitly incorporating the actorβs log-density term into the critic loss. Contributions/Results: Theoretically, we justify bounded advantage learning and establish equivalence between log-probability and the regularized advantage function. Technically, MDAC unifies KL-divergence and entropy dual regularization, advantage truncation, and continuous-policy gradient estimation. Empirically, MDAC significantly outperforms both unregularized and entropy-only baselines on standard continuous control benchmarks, demonstrating superior effectiveness and robustness.
π Abstract
Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of a MDVI-based method does not surpass an entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor's log-density terms in the critic's loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor's log-probability is equal to the regularized advantage function in tabular cases, and theoretically discuss when and why bounding the advantage terms is validated and beneficial. We also empirically explore a good choice for the bounding function, and show that MDAC perfoms better than strong non-regularized and entropy-only-regularized methods with an appropriate choice of the bounding function.