🤖 AI Summary
Neural network loss landscapes suffer from intrinsic manifold degeneracy, rendering traditional stochastic gradient Markov chain Monte Carlo (SGMCMC) methods—reliant on global convergence assumptions—incapable of accurately characterizing local posterior geometry. To address this, we propose a novel “local posterior sampling” paradigm and introduce the first scalable benchmark for evaluating local geometric properties of posteriors, thereby overcoming the limitations of global convergence requirements. Methodologically, we integrate stochastic gradient Langevin dynamics (SGLD) with RMSProp-based preconditioning to design a local sampling strategy tailored to high-dimensional degenerate manifolds. Experiments on models with millions to hundreds of millions of parameters demonstrate that our approach significantly improves the fidelity of local posterior distribution modeling and successfully captures nontrivial local statistical structures. This work establishes a new theoretical and practical benchmark for SGMCMC in deep learning, opening avenues for both rigorous analysis and real-world deployment.
📝 Abstract
Degeneracy is an inherent feature of the loss landscape of neural networks, but it is not well understood how stochastic gradient MCMC (SGMCMC) algorithms interact with this degeneracy. In particular, current global convergence guarantees for common SGMCMC algorithms rely on assumptions which are likely incompatible with degenerate loss landscapes. In this paper, we argue that this gap requires a shift in focus from global to local posterior sampling, and, as a first step, we introduce a novel scalable benchmark for evaluating the local sampling performance of SGMCMC algorithms. We evaluate a number of common algorithms, and find that RMSProp-preconditioned SGLD is most effective at faithfully representing the local geometry of the posterior distribution. Although we lack theoretical guarantees about global sampler convergence, our empirical results show that we are able to extract non-trivial local information in models with up to O(100M) parameters.