π€ AI Summary
This work addresses the limitations of conventional deep reinforcement learning approaches that rely on multivariate Gaussian policies, which struggle to effectively model multimodal action distributions commonly encountered in robotic tasks. To overcome this, the paper presents the first stable integration of normalizing flows into online robotic policy learning, introducing a simple yet effective mechanism to stabilize trainingβa longstanding challenge for normalizing flows in online reinforcement learning settings. By doing so, the method transcends the representational constraints of Gaussian policies and enables efficient learning of complex, multimodal behaviors. Experimental results demonstrate that the proposed approach achieves robust and superior performance across multiple simulated environments and successfully transfers to real-world robotic systems.
π Abstract
Deep Reinforcement Learning (DRL) has experienced significant advancements in recent years and has been widely used in many fields. In DRL-based robotic policy learning, however, current de facto policy parameterization is still multivariate Gaussian (with diagonal covariance matrix), which lacks the ability to model multi-modal distribution. In this work, we explore the adoption of a modern network architecture, i.e. Normalizing Flow (NF) as the policy parameterization for its ability of multi-modal modeling, closed form of log probability and low computation and memory overhead. However, naively training NF in online Reinforcement Learning (RL) usually leads to training instability. We provide a detailed analysis for this phenomenon and successfully address it via simple but effective technique. With extensive experiments in multiple simulation environments, we show our method, NFPO could obtain robust and strong performance in widely used robotic learning tasks and successfully transfer into real-world robots.