🤖 AI Summary
Balancing Bayesian regret control and statistical rigor in continual-interaction reinforcement learning remains challenging. Method: We propose Continuing Posterior Sampling (Continuing PSRL), the first formalization and rigorous analysis of a periodic posterior resampling exploration mechanism, integrated with γ-discounted MDP modeling and Bayesian online learning. Contribution/Results: We introduce the reward-average time τ to characterize environmental complexity and derive the first Bayesian regret upper bound applicable to continuing environments: $ ilde{O}( au S sqrt{A T})$, where $S$, $A$, and $T$ denote the number of states, actions, and total time steps, respectively. This bound guarantees sublinear cumulative regret theoretically, substantially extending the applicability and theoretical foundation of posterior sampling to non-episodic, infinite-horizon settings.
📝 Abstract
We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $gamma$-discounted return in that model. At each time, with probability $1-gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $ ilde{O}( au S sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $ au$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.