🤖 AI Summary
This work addresses the challenge of estimating epistemic uncertainty in value functions within reinforcement learning. We propose UVU, a single-model approach that quantifies policy-dependent value uncertainty via the squared prediction error between an online network and a fixed, randomly initialized target network. UVU is the first method to embed policy-conditioned uncertainty modeling directly into a single-network architecture. Theoretically, we prove that, in the infinite-width limit, UVU’s uncertainty estimate is strictly equivalent to the variance of an ensemble of universal value function approximators. To enhance exploration and credit assignment, UVU integrates temporal-difference-based synthetic rewards with random network distillation, underpinned by analysis grounded in the neural tangent kernel (NTK). Empirically, on multi-task offline RL benchmarks, UVU matches the performance of large ensembles while incurring substantially lower computational overhead—achieving a favorable trade-off among simplicity, efficiency, and theoretical rigor.
📝 Abstract
Estimating epistemic uncertainty in value functions is a crucial challenge for many aspects of reinforcement learning (RL), including efficient exploration, safe decision-making, and offline RL. While deep ensembles provide a robust method for quantifying value uncertainty, they come with significant computational overhead. Single-model methods, while computationally favorable, often rely on heuristics and typically require additional propagation mechanisms for myopic uncertainty estimates. In this work we introduce universal value-function uncertainties (UVU), which, similar in spirit to random network distillation (RND), quantify uncertainty as squared prediction errors between an online learner and a fixed, randomly initialized target network. Unlike RND, UVU errors reflect policy-conditional value uncertainty, incorporating the future uncertainties any given policy may encounter. This is due to the training procedure employed in UVU: the online network is trained using temporal difference learning with a synthetic reward derived from the fixed, randomly initialized target network. We provide an extensive theoretical analysis of our approach using neural tangent kernel (NTK) theory and show that in the limit of infinite network width, UVU errors are exactly equivalent to the variance of an ensemble of independent universal value functions. Empirically, we show that UVU achieves equal performance to large ensembles on challenging multi-task offline RL settings, while offering simplicity and substantial computational savings.