🤖 AI Summary
Direct integration of Soft Actor-Critic (SAC) with n-step returns introduces off-policy bias due to policy drift, while conventional importance sampling suffers from numerical instability and high variance. This paper proposes SACn, the first method enabling safe and stable fusion of SAC with n-step entropy-regularized reinforcement learning. Its core innovations are: (1) τ-sampled entropy estimation, which reduces variance in the target Q-function by decoupling entropy estimation from policy evaluation; and (2) a simplified importance sampling mechanism that eliminates high-variance weight accumulation and alleviates hyperparameter sensitivity. Evaluated on the MuJoCo benchmark, SACn achieves significantly faster convergence and improved policy stability compared to standard SAC and multiple n-step baselines. It consistently outperforms these methods across diverse tasks, establishing a robust and efficient framework for off-policy n-step maximum-entropy RL.
📝 Abstract
Soft Actor-Critic (SAC) is widely used in practical applications and is now one of the most relevant off-policy online model-free reinforcement learning (RL) methods. The technique of n-step returns is known to increase the convergence speed of RL algorithms compared to their 1-step returns-based versions. However, SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms due to the changes in action distribution. While this problem is solved by importance sampling, a method for estimating expected values of one distribution using samples from another distribution, importance sampling may result in numerical instability. In this work, we combine SAC with n-step returns in a way that overcomes this issue. We present an approach to applying numerically stable importance sampling with simplified hyperparameter selection. Furthermore, we analyze the entropy estimation approach of Soft Actor-Critic in the context of the n-step maximum entropy framework and formulate the $τ$-sampled entropy estimation to reduce the variance of the learning target. Finally, we formulate the Soft Actor-Critic with n-step returns (SAC$n$) algorithm that we experimentally verify on MuJoCo simulated environments.