🤖 AI Summary
Deep neural networks in reinforcement learning often produce poorly calibrated uncertainty estimates, hindering safe exploration and reliable decision-making. To address this, we propose a scalable, model-free Actor-Critic algorithm that integrates Deep Gaussian Processes (DGPs) with Proximal Policy Optimization (PPO), the first such approach to achieve both high-performance policy learning and well-calibrated uncertainty quantification in high-dimensional continuous control tasks. Our method employs stochastic variational inference for efficient DGP training and jointly models uncertainty in both the policy (Actor) and value function (Critic). Experiments on standard benchmarks demonstrate that the algorithm matches PPO’s asymptotic performance while delivering significantly improved uncertainty calibration—evidenced by lower expected calibration error and sharper, more reliable confidence intervals. This enhanced calibration directly improves exploration safety and decision trustworthiness in uncertain environments.
📝 Abstract
Uncertainty estimation for Reinforcement Learning (RL) is a critical component in control tasks where agents must balance safe exploration and efficient learning. While deep neural networks have enabled breakthroughs in RL, they often lack calibrated uncertainty estimates. We introduce Deep Gaussian Process Proximal Policy Optimization (GPPO), a scalable, model-free actor-critic algorithm that leverages Deep Gaussian Processes (DGPs) to approximate both the policy and value function. GPPO maintains competitive performance with respect to Proximal Policy Optimization on standard high-dimensional continuous control benchmarks while providing well-calibrated uncertainty estimates that can inform safer and more effective exploration.