🤖 AI Summary
This work addresses the absence of tight Bayesian regret bounds for Gaussian Process Posterior Sampling Reinforcement Learning (GP-PSRL) in unbounded state spaces. The authors propose a novel theoretical framework that recursively applies the Borell–Tsirelson–Ibragimov–Sudakov inequality to show that, with high probability, the states visited by the agent remain confined within a ball of approximately fixed radius. By integrating this concentration property with a chaining argument to control cumulative regret, they establish the first tight Bayesian regret bound for GP-PSRL in unbounded domains. Specifically, they derive a bound of $\widetilde{\mathcal{O}}(H^{3/2} \sqrt{\gamma_{T/H} T})$, where $H$ denotes the episode horizon, $T$ is the total number of time steps, and $\gamma_{T/H}$ is the maximum information gain, thereby overcoming limitations of existing theoretical analyses.
📝 Abstract
We analyze the Bayesian regret of the Gaussian process posterior sampling reinforcement learning (GP-PSRL) algorithm. Posterior sampling is an effective heuristic for decision-making under uncertainty that has been used to develop successful algorithms for a variety of continuous control problems. However, theoretical work on GP-PSRL is limited. All known regret bounds either fail to achieve a tight dependence on a kernel-dependent quantity called the maximum information gain or fail to properly account for the fact that the set of possible system states is unbounded. Through a recursive application of the Borell-Tsirelson-Ibragimov-Sudakov inequality, we show that, with high probability, the states actually visited by the algorithm are contained within a ball of near-constant radius. To obtain tight dependence on the maximum information gain, we use the chaining method to control the regret suffered by GP-PSRL. Our main result is a Bayesian regret bound of the order $\widetilde{\mathcal{O}}(H^{3/2}\sqrt{\gamma_{T/H} T})$, where $H$ is the horizon, $T$ is the number of time steps and $\gamma_{T/H}$ is the maximum information gain. With this result, we resolve the limitations with prior theoretical work on PSRL, and provide the theoretical foundation and tools for analyzing PSRL in complex settings.