Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

In discrete-action offline reinforcement learning, actor-critic methods (e.g., SAC) underperform value-based approaches (e.g., DQN), primarily due to entropy coupling between the actor and critic. Method: This paper identifies such coupling as the key bottleneck and proposes a novel entropy decoupling mechanism—separately regularizing actor and critic entropies. Building upon this, we introduce a general offline actor-critic framework supporting *m*-step Bellman updates, flexible policy optimization schemes, and theoretically guaranteed convergence. The framework unifies standard policy gradients and entropy-regularized objectives, and also accommodates exploration-free, entropy-unregularized training. Contribution/Results: Evaluated on the Atari benchmark, our method achieves performance on par with DQN—without entropy regularization or auxiliary exploration mechanisms—thereby substantially advancing both the practical applicability and theoretical soundness of offline policy learning in discrete action spaces.

Technology Category

Application Category

📝 Abstract

Value-based approaches such as DQN are the default methods for off-policy reinforcement learning with discrete-action environments such as Atari. Common policy-based methods are either on-policy and do not effectively learn from off-policy data (e.g. PPO), or have poor empirical performance in the discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC (DSAC), we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC can have comparable performance as DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case. Our framework allows using an m-step Bellman operator for the critic update, and enables combining standard policy optimization methods with entropy regularization to instantiate the resulting actor objective. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting. Empirically, we demonstrate that these methods can approach the performance of DQN on standard Atari games, and do so even without entropy regularization or explicit exploration.

Problem

Research questions and friction points this paper is trying to address.

Improving actor-critic methods for discrete action off-policy reinforcement learning

Addressing poor performance of discrete SAC through actor-critic decoupling

Developing flexible framework for off-policy learning in discrete environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupling actor-critic entropy components

Using m-step Bellman operator critic

Combining policy optimization with regularization

🔎 Similar Papers

Efficient Off-Policy Learning for High-Dimensional Action Spaces