No-Regret Thompson Sampling for Finite-Horizon Markov Decision Processes with Gaussian Processes

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses finite-horizon Markov decision processes (MDPs) with unknown rewards and state transitions, exhibiting complex temporal dependencies. We establish the first no-regret guarantee for Thompson sampling (TS) under a joint Gaussian process (GP) prior over rewards and transitions. The core analytical challenges stem from the non-Gaussianity of value functions and the intractability of Bayesian updates under multi-step Bellman recursion. To overcome these, we: (1) propose a joint GP model for rewards and transitions, and design a TS algorithm tailored to multi-step optimization; (2) develop the first analytical pathway for deriving no-regret bounds for TS within GP-based Bellman recursion; and (3) extend the elliptical potential lemma to the multi-output setting, thereby resolving the non-Gaussian value function bottleneck. Our analysis yields a regret bound of $ ilde{O}(sqrt{KHGamma(KH)})$, where $Gamma$ quantifies GP complexity. This result provides a foundational theoretical guarantee for structured Bayesian reinforcement learning.

Technology Category

Application Category

📝 Abstract
Thompson sampling (TS) is a powerful and widely used strategy for sequential decision-making, with applications ranging from Bayesian optimization to reinforcement learning (RL). Despite its success, the theoretical foundations of TS remain limited, particularly in settings with complex temporal structure such as RL. We address this gap by establishing no-regret guarantees for TS using models with Gaussian marginal distributions. Specifically, we consider TS in episodic RL with joint Gaussian process (GP) priors over rewards and transitions. We prove a regret bound of $mathcal{ ilde{O}}(sqrt{KHGamma(KH)})$ over $K$ episodes of horizon $H$, where $Gamma(cdot)$ captures the complexity of the GP model. Our analysis addresses several challenges, including the non-Gaussian nature of value functions and the recursive structure of Bellman updates, and extends classical tools such as the elliptical potential lemma to multi-output settings. This work advances the understanding of TS in RL and highlights how structural assumptions and model uncertainty shape its performance in finite-horizon Markov Decision Processes.
Problem

Research questions and friction points this paper is trying to address.

Establishes no-regret guarantees for Thompson sampling in reinforcement learning
Analyzes Thompson sampling with Gaussian process priors over rewards and transitions
Extends theoretical tools to address non-Gaussian value functions in MDPs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Thompson sampling with Gaussian process priors
Regret bound analysis for episodic reinforcement learning
Extension of elliptical potential lemma to multi-output settings
🔎 Similar Papers
No similar papers found.