π€ AI Summary
This work addresses the challenges of low sample efficiency and training instability in reinforcement learning under stochastic or noisy environments by proposing Distributional Sobolev Training. It introduces a novel approach that jointly models the state-action value function and its gradient as a distribution, constructing a distributional Bellman operator grounded in a first-order world model. The paper establishes the existence of a unique fixed point for this operator, revealing a smoothness trade-off inherent in gradient-aware reinforcement learning. The method employs a conditional variational autoencoder (cVAE) to model the environment dynamics and reward distributions, combined with Max-sliced Maximum Mean Discrepancy (MMD) for distributional Bellman updates. Empirical evaluations on stochastic toy tasks and multiple MuJoCo benchmarks demonstrate significant improvements over existing methods such as MAGE, confirming the approachβs effectiveness and robustness.
π Abstract
Gradient-regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return gradients. However, existing approaches, such as MAGE, struggle in stochastic or noisy environments, limiting their applicability. In this work, we address these limitations by extending distributional reinforcement learning on continuous state-action spaces to model not only the distribution over scalar state-action value functions but also over their gradients. We refer to this approach as Distributional Sobolev Training. Inspired by Stochastic Value Gradients (SVG), our method utilizes a one-step world model of reward and transition distributions implemented via a conditional Variational Autoencoder (cVAE). The proposed framework is sample-based and employs Max-sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional Bellman operator. We prove that the Sobolev-augmented Bellman operator is a contraction with a unique fixed point, and highlight a fundamental smoothness trade-off underlying contraction in gradient-aware RL. To validate our method, we first showcase its effectiveness on a simple stochastic reinforcement learning toy problem, then benchmark its performance on several MuJoCo environments.