Moments Matter:Stabilizing Policy Optimization using Return Distributions

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the significant behavioral variability of deep reinforcement learning policies under identical expected returns, a phenomenon exacerbated by environmental and algorithmic noise that leads to unstable gaits and hinders both algorithmic comparison and real-world transfer. To mitigate this issue, the authors propose a novel approach that avoids explicit estimation of the return distribution after policy updates. Instead, they employ a distributional critic to model the state-action return distribution and leverage its higher-order moments—specifically skewness and kurtosis—to refine the advantage function in Proximal Policy Optimization (PPO). This modification steers the policy away from unstable parameter regions. Evaluated on benchmarks such as Walker2D, the method improves policy stability by up to 75% while maintaining comparable evaluation returns, substantially enhancing robustness and transferability.

Technology Category

Application Category

📝 Abstract

Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the spread of post-update return distribution $R(\theta)$, obtained by repeatedly sampling minibatches, updating $\theta$, and measuring final returns, is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow $R(\theta)$ can improve stability, directly estimating $R(\theta)$ is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental stochasticity to mitigate update-induced variability. Specifically, we model state-action return distribution through a distributional critic and then bias the advantage function of PPO using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, our method discourages policies from entering parameter regimes prone to instability. We hypothesize that in environments where post-update critic values align poorly with post-update returns, standard PPO struggles to produce a narrow $R(\theta)$. In such cases, our moment-based correction narrows $R(\theta)$, improving stability by up to 75% in Walker2D, while preserving comparable evaluation returns.

Problem

Research questions and friction points this paper is trying to address.

policy optimization

return distribution

stability

deep reinforcement learning

continuous control

Innovation

Methods, ideas, or system contributions that make the work stand out.

distributional reinforcement learning

policy optimization stability

higher-order moments