EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the exploration challenge in reinforcement learning under sparse rewards, long horizons, and high stochasticity, this paper proposes an online adaptive exploration mechanism grounded in epistemic uncertainty. Methodologically, it introduces epistemic uncertainty as a real-time navigation signal to guide Bayesian policy updates, enabling joint optimization of posterior sampling and expressible priors within a discounted infinite-horizon MDP framework. Theoretically, it establishes a near-minimax-optimal regret bound and provides rigorous sample complexity guarantees. Empirically, the method achieves significant improvements in sample efficiency, scalability, and performance stability across diverse challenging tasks—outperforming state-of-the-art Bayesian and heuristic exploration algorithms.

Technology Category

Application Category

📝 Abstract
At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, $ exttt{EUBRL}$, which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a class of sufficiently expressive priors in infinite-horizon discounted MDPs. Empirically, we evaluate $ exttt{EUBRL}$ on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that $ exttt{EUBRL}$ achieves superior sample efficiency, scalability, and consistency.
Problem

Research questions and friction points this paper is trying to address.

Addresses exploration-exploitation dilemma using epistemic uncertainty
Reduces per-step regret from estimation errors adaptively
Achieves optimal regret and sample complexity in MDPs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian RL algorithm using epistemic uncertainty for exploration
Adaptively reduces per-step regret from estimation errors
Provides near-optimal regret guarantees for expressive priors in MDPs