🤖 AI Summary
Existing Generalized Advantage Estimation (GAE) struggles to model the value distribution characteristics inherent in distributed reinforcement learning, leading to unstable advantage estimation under stochasticity and noise. To address this, we propose Distributional Generalized Advantage Estimation (DGAE), the first GAE variant incorporating optimal transport theory. DGAE introduces a distributional similarity metric grounded in the Wasserstein distance and endowed with direction-awareness, enabling robust comparison of value distributions. Coupled with exponential-weighted policy gradient estimation, DGAE achieves low-variance, bias-controllable advantage function approximation. DGAE seamlessly integrates into mainstream policy gradient algorithms—e.g., PPO and A2C—without architectural modification. Empirical evaluation across multiple OpenAI Gym benchmarks demonstrates significant improvements in both sample efficiency and final policy performance. This work establishes the first optimal-transport-based advantage estimation paradigm for distributional reinforcement learning.
📝 Abstract
Generalized Advantage Estimation (GAE) has been used to mitigate the computational complexity of reinforcement learning (RL) by employing an exponentially weighted estimation of the advantage function to reduce the variance in policy gradient estimates. Despite its effectiveness, GAE is not designed to handle value distributions integral to distributional RL, which can capture the inherent stochasticity in systems and is hence more robust to system noises. To address this gap, we propose a novel approach that utilizes the optimal transport theory to introduce a Wasserstein-like directional metric, which measures both the distance and the directional discrepancies between probability distributions. Using the exponentially weighted estimation, we leverage this Wasserstein-like directional metric to derive distributional GAE (DGAE). Similar to traditional GAE, our proposed DGAE provides a low-variance advantage estimate with controlled bias, making it well-suited for policy gradient algorithms that rely on advantage estimation for policy updates. We integrated DGAE into three different policy gradient methods. Algorithms were evaluated across various OpenAI Gym environments and compared with the baselines with traditional GAE to assess the performance.