Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Hybrid policies are theoretically more expressive in continuous-action reinforcement learning but have been seldom adopted due to the lack of low-variance gradient estimators. This work proposes the Marginalized Reparameterization (MRP) estimator, which, for the first time, enables low-variance gradient estimation for hybrid policies within an entropy-regularized actor-critic framework. Theoretically, MRP is shown to dominate conventional likelihood ratio methods in terms of gradient variance. Empirical results demonstrate that MRP substantially enhances the stability and performance of hybrid policies, significantly outperforming likelihood ratio baselines and even matching or surpassing state-of-the-art Gaussian policies across standard benchmarks including Gym MuJoCo, DeepMind Control Suite, and MetaWorld.

📝 Abstract

Mixture policies theoretically offer greater flexibility than unimodal policies in continuous action reinforcement learning, but the practical benefits of this complexity remain elusive. Mixture policies are notably absent from most state-of-the-art algorithms, raising a fundamental question: Is the added representational overhead useful? We show that increased flexibility can theoretically enhance solution quality and entropy robustness. Yet standard algorithms like SAC do not leverage these advantages. A core issue is the lack of a low-variance reparameterization trick for mixtures, a luxury Gaussian policies enjoy. We propose a marginalized reparameterization (MRP) estimator to address this, proving it offers lower variance than the standard likelihood-ratio (LR) approach. Our experiments across Gym MuJoCo, DeepMind Control Suite, and MetaWorld show that MRP mixture policies significantly outperform their LR ones, and reach parity (sometimes better) with Gaussian counterparts. In addition, we do find several cases where MRP mixture policies exhibit clear empirical advantages. In this paper, we provide a clearer understanding of the trade-offs involved, elevating MRP mixture policies from theoretical curiosity to a practical tool.

Problem

Research questions and friction points this paper is trying to address.

mixture policies

continuous action reinforcement learning

representational overhead

entropy regularization

policy flexibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixture policies

reparameterization

low-variance gradient estimation