🤖 AI Summary
In discrete-action offline reinforcement learning, actor-critic methods (e.g., SAC) underperform value-based approaches (e.g., DQN), primarily due to entropy coupling between the actor and critic.
Method: This paper identifies such coupling as the key bottleneck and proposes a novel entropy decoupling mechanism—separately regularizing actor and critic entropies. Building upon this, we introduce a general offline actor-critic framework supporting *m*-step Bellman updates, flexible policy optimization schemes, and theoretically guaranteed convergence. The framework unifies standard policy gradients and entropy-regularized objectives, and also accommodates exploration-free, entropy-unregularized training.
Contribution/Results: Evaluated on the Atari benchmark, our method achieves performance on par with DQN—without entropy regularization or auxiliary exploration mechanisms—thereby substantially advancing both the practical applicability and theoretical soundness of offline policy learning in discrete action spaces.
📝 Abstract
Value-based approaches such as DQN are the default methods for off-policy reinforcement learning with discrete-action environments such as Atari. Common policy-based methods are either on-policy and do not effectively learn from off-policy data (e.g. PPO), or have poor empirical performance in the discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC (DSAC), we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC can have comparable performance as DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case. Our framework allows using an m-step Bellman operator for the critic update, and enables combining standard policy optimization methods with entropy regularization to instantiate the resulting actor objective. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting. Empirically, we demonstrate that these methods can approach the performance of DQN on standard Atari games, and do so even without entropy regularization or explicit exploration.