🤖 AI Summary
This work addresses the large implicit constants in the theoretical regret bound of UCBVI (Azar et al., 2017), which cause a significant gap between theory and empirical performance. We propose the first systematic improvement: restructuring the exploration bonus, reanalyzing error propagation via Bernstein-type concentration inequalities, and optimizing the optimistic confidence intervals for both reward and transition models. While preserving the optimal $O(sqrt{HSAT})$ regret bound, our approach substantially tightens the leading constant. Theoretical analysis demonstrates a marked reduction in the gap between the theoretical regret bound and actual performance. Empirical evaluation on standard MDP benchmarks shows 30–50% improvement in sample efficiency, faster convergence, enhanced stability, and consistent superiority over both the original UCBVI and the state-of-the-art MVP algorithm.
📝 Abstract
In this work, we provide a refined analysis of the UCBVI algorithm (Azar et al., 2017), improving both the bonus terms and the regret analysis. Additionally, we compare our version of UCBVI with both its original version and the state-of-the-art MVP algorithm. Our empirical validation demonstrates that improving the multiplicative constants in the bounds has significant positive effects on the empirical performance of the algorithms.