🤖 AI Summary
This work addresses the challenge of predicting generalization in deep nonlinear Bayesian neural networks in the joint limit where both training set size and network width scale proportionally. The authors propose an equivalent Wishart hypothesis to characterize the dominant random fluctuations of layerwise empirical kernels in multilayer perceptrons, and combine it with large deviation theory to derive a partition function expressed in terms of a renormalized Neural Network Gaussian Process (NNGP) kernel. For the first time, this non-perturbative approach is extended to deep Bayesian MLPs and CNNs, yielding a kernel renormalization framework governed solely by L self-consistent scalar order parameters. This framework reveals a data-dependent kernel transformation mechanism at finite width. The theory is validated on Bayesian networks with depth around 10 and training sets of size ∼10³, showing excellent agreement with posterior sampling and identifying two distinct types of systematic bias.
📝 Abstract
The scaling limit where both the size of the training set $P$ and the width $N$ of a deep neural network grow at the same rate, the so-called proportional-width regime, has been intensely studied for shallow, single-hidden-layer networks. However, extending these non-perturbative results from shallow architectures to deep non-linear networks has proven very challenging. Here we present an effective approximate approach to predict the generalization performance of Bayesian multi-layer perceptrons (MLPs) of fixed depth $L$ on arbitrary high-dimensional data. We propose an equivalent Wishart Ansatz to capture the dominant stochastic fluctuations of the hierarchical empirical kernels of MLPs. This allows us to perform a large deviation analysis for the partition function of MLPs in the proportional limit, expressed in terms of a renormalized NNGP kernel. In this description, even strong representation learning in the proportional limit is encoded in at most $L$ scalar order parameters, determined self-consistently. Extending the approach to convolutional architectures (CNNs), we identify a hierarchical local kernel renormalization mechanism, which allows to quantify more complex data-dependent transformations of the large-width kernel in CNNs due to finite-width effects. We test our effective theory against sampling experiments from the Bayesian posterior of finite deep neural networks with depths $L \sim O(10)$ and $P\sim O(10^3)$ on classic benchmark datasets, finding overall very good agreement together with two distinct types of systematic deviations.