🤖 AI Summary
Existing deterministic and stochastic policy gradient methods for continuous-action reinforcement learning suffer from theoretical incompatibility and reliance on the reparameterization trick. Method: We propose Wasserstein Policy Optimization (WPO), the first approach to integrate Wasserstein gradient flow theory into policy optimization. WPO constructs a projected gradient flow over the parameterized policy space, yielding a general closed-form policy update rule that bypasses reparameterization. It inherently unifies the efficiency of deterministic policy gradients with the exploration robustness of stochastic ones, supporting arbitrary differentiable action distributions. Within an Actor-Critic framework, WPO explicitly leverages gradients of the action-value function to enhance update precision. Results: Empirical evaluation on the DeepMind Control Suite and a magnetic confinement nuclear fusion control task demonstrates that WPO significantly outperforms state-of-the-art methods—including SAC and TD3—validating both its theoretical consistency and practical superiority.
📝 Abstract
We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.