🤖 AI Summary
Standard Soft Actor-Critic (SAC) employs reverse KL divergence for policy updates, rendering the optimal policy projection analytically intractable and necessitating gradient-based approximations—leading to training instability and poor sample efficiency. This work proposes forward KL divergence as a principled alternative, enabling the first closed-form optimal policy projection within SAC. We further introduce a bidirectional optimization framework integrating forward initialization with reverse fine-tuning. Theoretically, our approach unifies KL analysis under Gaussian policies, Boltzmann action marginal modeling, and policy projection theory; methodologically, it jointly ensures stability and optimality. Evaluated on standard continuous-control benchmarks, the proposed method achieves a 30% average improvement in episode return, while significantly enhancing sample efficiency and training robustness.
📝 Abstract
The Soft Actor-Critic (SAC) algorithm, a state-of-the-art method in maximum entropy reinforcement learning, traditionally relies on minimizing reverse Kullback-Leibler (KL) divergence for policy updates. However, this approach leads to an intractable optimal projection policy, necessitating gradient-based approximations that can suffer from instability and poor sample efficiency. This paper investigates the alternative use of forward KL divergence within SAC. We demonstrate that for Gaussian policies, forward KL divergence yields an explicit optimal projection policy -- corresponding to the mean and variance of the target Boltzmann distribution's action marginals. Building on the distinct advantages of both KL directions, we propose Bidirectional SAC, an algorithm that first initializes the policy using the explicit forward KL projection and then refines it by optimizing the reverse KL divergence. Comprehensive experiments on continuous control benchmarks show that Bidirectional SAC significantly outperforms standard SAC and other baselines, achieving up to a $30%$ increase in episodic rewards, alongside enhanced sample efficiency.