🤖 AI Summary
This paper systematically investigates the performance trade-offs between Bayesian (distributional) and frequentist (point) inference in amortized in-context learning. Within a unified experimental framework, it compares maximum likelihood/maximum a posteriori estimation (MLE/MAP), diagonal Gaussian variational approximation, normalizing flows, and score-based diffusion samplers across linear models to shallow neural networks, under both in-distribution and out-of-distribution generalization settings. The study reveals— for the first time—that point estimators consistently outperform distributional estimators across most tasks, especially in high-dimensional regimes; the latter remain competitive only in low-dimensional settings. This finding uncovers an implicit “simplicity-benefits” principle in amortized context inference. It fills a longstanding gap in deep learning by providing the first empirically grounded, comparable assessment of Bayesian versus frequentist inference paradigms, thereby offering both theoretical justification and practical guidance for model selection in context-aware learning.
📝 Abstract
Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random variables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.