Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of epistemic uncertainty in early-stage online reinforcement learning, where scarce data necessitate a careful trade-off between robustness and exploration. The authors propose the Quantile-based Bayesian Risk Markov Decision Process (BR-MDP), which modulates the influence of posterior uncertainty in Bellman backups via quantile control and introduces an adaptive quantile scheduling mechanism that prioritizes robustness initially and gradually promotes exploration as data accumulate. Theoretical analysis establishes the asymptotic normality of the value function estimation error and proves a sublinear Bayesian regret bound relative to both the true optimal policy and the BR-MDP’s robust optimal policy. Empirical results demonstrate that the proposed method significantly outperforms baseline approaches in environments characterized by either high exploration demands or high exploration costs.
📝 Abstract
In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient exploration is needed to learn the true-environment optimal policy. We study this time-varying robustness--exploration trade-off through a quantile Bayesian risk-aware Markov decision process (BR-MDP), in which the quantile level controls how posterior uncertainty enters the Bellman backup. We characterize this control through an asymptotic normality result for the difference between the quantile BR-MDP value and the value in the true environment. The result implies that upper/lower-tail quantiles induce optimism/pessimism towards epistemic uncertainty, and the magnitude of the optimism/pessimism decreases as data accumulate. Building on this characterization, we propose an online Bayesian risk-aware algorithm with an adaptive quantile schedule that emphasizes robustness early and gradually encourages exploration of less-visited state--action pairs. We establish sublinear Bayesian regret bounds with respect to both the true optimal value and the optimal BR-MDP robust value. Numerical experiments demonstrate strong performance in both exploration-demanding and exploration-costly environments.
Problem

Research questions and friction points this paper is trying to address.

robustness
exploration
online reinforcement learning
epistemic uncertainty
Bayesian risk
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantile Bayesian Risk MDP
Robustness–Exploration Trade-off
Epistemic Uncertainty
Adaptive Quantile Schedule
Bayesian Regret
🔎 Similar Papers
No similar papers found.