🤖 AI Summary
Existing offline reinforcement learning (RL) methods predominantly employ asymmetric f-divergences—such as KL divergence—for behavioral regularization, enabling analytic policy solutions and mitigating numerical instability; symmetric f-divergences have been largely overlooked due to the absence of closed-form solutions and susceptibility to gradient explosion.
Method: This paper introduces symmetric f-divergences into behavioral regularization for the first time, proposing an analytically tractable policy optimization framework based on second-order Taylor expansion. By decomposing the symmetric divergence into symmetric and conditional-symmetric components, we derive explicit closed-form policy updates and decouple the loss function to enhance numerical stability.
Contribution/Results: Our approach achieves state-of-the-art performance on MuJoCo benchmarks and distribution-matching tasks, significantly outperforming mainstream offline RL algorithms. It bridges theoretical rigor—via principled symmetric regularization—with strong empirical robustness, establishing a new foundation for stable and expressive offline policy learning.
📝 Abstract
This paper introduces symmetric divergences to behavior regularization policy optimization (BRPO) to establish a novel offline RL framework. Existing methods focus on asymmetric divergences such as KL to obtain analytic regularized policies and a practical minimization objective. We show that symmetric divergences do not permit an analytic policy as regularization and can incur numerical issues as loss. We tackle these challenges by the Taylor series of $f$-divergence. Specifically, we prove that an analytic policy can be obtained with a finite series. For loss, we observe that symmetric divergences can be decomposed into an asymmetry and a conditional symmetry term, Taylor-expanding the latter alleviates numerical issues. Summing together, we propose Symmetric $f$ Actor-Critic (S$f$-AC), the first practical BRPO algorithm with symmetric divergences. Experimental results on distribution approximation and MuJoCo verify that S$f$-AC performs competitively.