Bidirectional Soft Actor-Critic: Leveraging Forward and Reverse KL Divergence for Efficient Reinforcement Learning

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Standard Soft Actor-Critic (SAC) employs reverse KL divergence for policy updates, rendering the optimal policy projection analytically intractable and necessitating gradient-based approximations—leading to training instability and poor sample efficiency. This work proposes forward KL divergence as a principled alternative, enabling the first closed-form optimal policy projection within SAC. We further introduce a bidirectional optimization framework integrating forward initialization with reverse fine-tuning. Theoretically, our approach unifies KL analysis under Gaussian policies, Boltzmann action marginal modeling, and policy projection theory; methodologically, it jointly ensures stability and optimality. Evaluated on standard continuous-control benchmarks, the proposed method achieves a 30% average improvement in episode return, while significantly enhancing sample efficiency and training robustness.

Technology Category

Application Category

📝 Abstract

The Soft Actor-Critic (SAC) algorithm, a state-of-the-art method in maximum entropy reinforcement learning, traditionally relies on minimizing reverse Kullback-Leibler (KL) divergence for policy updates. However, this approach leads to an intractable optimal projection policy, necessitating gradient-based approximations that can suffer from instability and poor sample efficiency. This paper investigates the alternative use of forward KL divergence within SAC. We demonstrate that for Gaussian policies, forward KL divergence yields an explicit optimal projection policy -- corresponding to the mean and variance of the target Boltzmann distribution's action marginals. Building on the distinct advantages of both KL directions, we propose Bidirectional SAC, an algorithm that first initializes the policy using the explicit forward KL projection and then refines it by optimizing the reverse KL divergence. Comprehensive experiments on continuous control benchmarks show that Bidirectional SAC significantly outperforms standard SAC and other baselines, achieving up to a $30%$ increase in episodic rewards, alongside enhanced sample efficiency.

Problem

Research questions and friction points this paper is trying to address.

Investigates forward KL divergence in SAC for policy updates

Proposes Bidirectional SAC combining forward and reverse KL advantages

Improves sample efficiency and episodic rewards by 30%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses forward KL divergence for explicit policy

Combines forward and reverse KL divergences

Improves sample efficiency and reward

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL