🤖 AI Summary
This work addresses the challenge of enabling models pretrained on synthetic data to effectively solve empirical Bayes problems under arbitrary, unknown test distributions. To this end, it introduces the notion of a “universal prior” and integrates it with a pretrained Transformer architecture to achieve adaptive inference across diverse test distributions. In the Poisson empirical Bayes setting, the method is theoretically shown to attain a near-optimal regret bound of $\widetilde{O}(1/n)$ uniformly over all test distributions. Furthermore, the paper elucidates the model’s ability to generalize beyond the training sequence length through the lens of posterior shrinkage and Bayesian inference, offering principled insights into its robust out-of-distribution performance.
📝 Abstract
We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of $\widetilde{O}(\frac{1}{n})$ uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This perspective also explains the phenomenon of length generalization, in which the test sequence length exceeds the training length, as the model performs Bayesian inference using a generalized posterior.