🤖 AI Summary
This work addresses the challenge of epistemic uncertainty in early-stage online reinforcement learning, where scarce data necessitate a careful trade-off between robustness and exploration. The authors propose the Quantile-based Bayesian Risk Markov Decision Process (BR-MDP), which modulates the influence of posterior uncertainty in Bellman backups via quantile control and introduces an adaptive quantile scheduling mechanism that prioritizes robustness initially and gradually promotes exploration as data accumulate. Theoretical analysis establishes the asymptotic normality of the value function estimation error and proves a sublinear Bayesian regret bound relative to both the true optimal policy and the BR-MDP’s robust optimal policy. Empirical results demonstrate that the proposed method significantly outperforms baseline approaches in environments characterized by either high exploration demands or high exploration costs.
📝 Abstract
In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient exploration is needed to learn the true-environment optimal policy. We study this time-varying robustness--exploration trade-off through a quantile Bayesian risk-aware Markov decision process (BR-MDP), in which the quantile level controls how posterior uncertainty enters the Bellman backup. We characterize this control through an asymptotic normality result for the difference between the quantile BR-MDP value and the value in the true environment. The result implies that upper/lower-tail quantiles induce optimism/pessimism towards epistemic uncertainty, and the magnitude of the optimism/pessimism decreases as data accumulate. Building on this characterization, we propose an online Bayesian risk-aware algorithm with an adaptive quantile schedule that emphasizes robustness early and gradually encourages exploration of less-visited state--action pairs. We establish sublinear Bayesian regret bounds with respect to both the true optimal value and the optimal BR-MDP robust value. Numerical experiments demonstrate strong performance in both exploration-demanding and exploration-costly environments.