SACn: Soft Actor-Critic with n-step Returns

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Direct integration of Soft Actor-Critic (SAC) with n-step returns introduces off-policy bias due to policy drift, while conventional importance sampling suffers from numerical instability and high variance. This paper proposes SACn, the first method enabling safe and stable fusion of SAC with n-step entropy-regularized reinforcement learning. Its core innovations are: (1) τ-sampled entropy estimation, which reduces variance in the target Q-function by decoupling entropy estimation from policy evaluation; and (2) a simplified importance sampling mechanism that eliminates high-variance weight accumulation and alleviates hyperparameter sensitivity. Evaluated on the MuJoCo benchmark, SACn achieves significantly faster convergence and improved policy stability compared to standard SAC and multiple n-step baselines. It consistently outperforms these methods across diverse tasks, establishing a robust and efficient framework for off-policy n-step maximum-entropy RL.

Technology Category

Application Category

📝 Abstract

Soft Actor-Critic (SAC) is widely used in practical applications and is now one of the most relevant off-policy online model-free reinforcement learning (RL) methods. The technique of n-step returns is known to increase the convergence speed of RL algorithms compared to their 1-step returns-based versions. However, SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms due to the changes in action distribution. While this problem is solved by importance sampling, a method for estimating expected values of one distribution using samples from another distribution, importance sampling may result in numerical instability. In this work, we combine SAC with n-step returns in a way that overcomes this issue. We present an approach to applying numerically stable importance sampling with simplified hyperparameter selection. Furthermore, we analyze the entropy estimation approach of Soft Actor-Critic in the context of the n-step maximum entropy framework and formulate the $τ$-sampled entropy estimation to reduce the variance of the learning target. Finally, we formulate the Soft Actor-Critic with n-step returns (SAC$n$) algorithm that we experimentally verify on MuJoCo simulated environments.

Problem

Research questions and friction points this paper is trying to address.

Combines SAC with n-step returns to increase convergence speed

Addresses bias and instability in off-policy n-step importance sampling

Reduces variance in entropy estimation for stable learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

SACn combines SAC with n-step returns

Uses numerically stable importance sampling

Introduces τ-sampled entropy estimation

🔎 Similar Papers

Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning