🤖 AI Summary
Variational Bayesian (VB) inference traditionally relies on conjugate priors or analytic approximations, limiting its scalability and applicability to modern large-scale generative models.
Method: Leveraging information geometry, we establish a fundamental equivalence: under exponential-family variational distributions, VB optimization corresponds to natural-gradient ascent, with the objective interpretable as a quadratic surrogate of the KL divergence. Building on this, we propose the Natural-Gradient Bayesian Learning Rule—a unified framework for posterior updates via natural-gradient accumulation—bypassing conjugacy requirements. We further extend it to large-scale settings, designing an efficient variational inference algorithm tailored for foundation models.
Contribution/Results: Our work provides a geometric foundation for Bayesian learning, unifying posterior inference through natural gradients. Empirically, the method significantly accelerates training convergence and improves inference accuracy in large language models, advancing scalable variational inference for state-of-the-art generative modeling.
📝 Abstract
We highlight a fundamental connection between information geometry and variational Bayes (VB) and discuss its consequences for machine learning. Under certain conditions, a VB solution always requires estimation or computation of natural gradients. We show several consequences of this fact by using the natural-gradient descent algorithm of Khan and Rue (2023) called the Bayesian Learning Rule (BLR). These include (i) a simplification of Bayes' rule as addition of natural gradients, (ii) a generalization of quadratic surrogates used in gradient-based methods, and (iii) a large-scale implementation of VB algorithms for large language models. Neither the connection nor its consequences are new but we further emphasize the common origins of the two fields of information geometry and Bayes with a hope to facilitate more work at the intersection of the two fields.