🤖 AI Summary
This work addresses the challenge of efficiently implementing Thompson Sampling in contextual multi-armed bandits to balance exploration and exploitation. The authors propose PFN-TS, an algorithm that leverages Prior-data Fitted Networks (PFNs) to approximate Bayesian posteriors. Innovatively combining subsampled prediction with the Central Limit Theorem and a geometric prefix subsampling strategy, PFN-TS translates PFN outputs into uncertainty estimates of mean rewards while substantially reducing the computational complexity of variance estimation. Efficiency is further enhanced through cached representation reuse. Empirical evaluations demonstrate that PFN-TS achieves the best average rank on nonlinear synthetic datasets and OpenML classification-to-bandit benchmarks, shows competitive performance on linear and BART-generated tasks, and attains the highest policy value in offline mobile health assessments.
📝 Abstract
Thompson sampling is a widely used strategy for contextual bandits: at each round, it samples a reward function from a Bayesian posterior and acts greedily under that sample. Prior-data fitted networks (PFNs), such as TabPFN v2+ and TabICL v2, are attractive candidates for this purpose because they approximate Bayesian posterior predictive distributions in a single forward pass. However, PFNs predict noisy future rewards, while Thompson sampling requires uncertainty over the latent mean reward function. We propose PFN-TS, a Thompson sampling algorithm that converts PFN posterior predictives into mean-reward samples using a subsampled predictive central limit theorem. The method estimates posterior variance from a geometric grid of $O(\log n)$ dataset prefixes rather than the full $O(n)$ predictive sequence used in previous predictive-sequence approaches, and reuses TabICL's cached representations across rounds. We prove consistency of the subsampled variance estimator and give a Bayesian regret bound that decomposes PFN-TS regret into exact posterior-sampling regret under the PFN prior plus approximation terms. Empirically, PFN-TS achieves the best average rank across nonlinear synthetic and OpenML classification-to-bandit benchmarks, remains competitive on linear and BART-generated rewards, and attains the highest estimated policy value in an offline mobile-health evaluation. Code is available at https://anonymous.4open.science/r/PFN_TS-36ED/.