🤖 AI Summary
This work challenges the conventional wisdom that “wider networks generalize better,” investigating the learning dynamics of Bayesian Parallel-Branch Graph Neural Networks (BPB-GNNs) in the narrow-width regime—where width is asymptotically smaller than the number of samples.
Method: Integrating Bayesian deep learning, kernel methods, and symmetry-breaking theory, we establish the first analytical framework for parallel-branch networks in the narrow-width limit. Our analysis reveals kernel renormalization-induced symmetry breaking across branches, decoupling the readout norm from hyperparameters and rendering it dependent solely on intrinsic data structure.
Results: Theoretically and empirically, we demonstrate that narrow-width BPB-GNNs achieve test accuracy comparable to—or exceeding—that of their wide-width counterparts under bias-constrained settings, while exhibiting enhanced robustness. Crucially, this narrow-width effect is an architectural hallmark of parallel-branch GNNs, validated across multiple graph learning benchmarks.
📝 Abstract
The infinite width limit of random neural networks is known to result in Neural Networks as Gaussian Process (NNGP) (Lee et al. [2018]), characterized by task-independent kernels. It is widely accepted that larger network widths contribute to improved generalization (Park et al. [2019]). However, this work challenges this notion by investigating the narrow width limit of the Bayesian Parallel Branching Graph Neural Network (BPB-GNN), an architecture that resembles residual networks. We demonstrate that when the width of a BPB-GNN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning due to a symmetry breaking of branches in kernel renormalization. Surprisingly, the performance of a BPB-GNN in the narrow width limit is generally superior or comparable to that achieved in the wide width limit in bias-limited scenarios. Furthermore, the readout norms of each branch in the narrow width limit are mostly independent of the architectural hyperparameters but generally reflective of the nature of the data. Our results characterize a newly defined narrow-width regime for parallel branching networks in general.