Fast Convergence of Softmax Policy Mirror Ascent

📅 2024-11-18
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the slow convergence, reliance on compatible function approximation, and non-convex optimization bottlenecks of Natural Policy Gradient (NPG) in large-scale MDPs, this paper proposes Softmax Policy Mirror Ascent (SPMA), a policy optimization framework that performs mirror ascent in the logit dual space. SPMA eliminates the need for action-space normalization and unifies treatment across tabular and function approximation settings. It achieves linear convergence under mere linear function approximation—without requiring compatibility assumptions or solving non-convex subproblems—for the first time. Each iteration reduces to solving a convex softmax classification problem. Theoretically, SPMA matches NPG’s convergence rate in tabular MDPs. Empirically, it attains performance competitive with or superior to MDPO, PPO, and TRPO on MuJoCo and Atari benchmarks.

Technology Category

Application Category

📝 Abstract
Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.
Problem

Research questions and friction points this paper is trying to address.

Refining mirror ascent policy optimization for faster convergence
Extending SPMA to large state-action spaces efficiently
Achieving linear convergence without compatible function approximation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Refines mirror ascent in dual logit space
Removes action normalization requirement
Extends to log-linear policy parameterization
🔎 Similar Papers