🤖 AI Summary
This work addresses the fundamental “depth effect” on generalization in deep neural networks (DNNs). We derive the first layer-wise information-theoretic generalization bound, quantifying representation divergence between training and test distributions via KL divergence and 1-Wasserstein distance across hidden layers. Our analysis reveals that depth promotes generalization through progressive information compression. We introduce the novel concept of a “generalization funnel” layer and prove that the KL-based bound monotonically contracts with increasing network depth. Leveraging the strong data processing inequality (SDPI), we rigorously quantify information compression induced by Dropout, DropConnect, and Gaussian noise injection. Under linear DNNs and finite-parameter assumptions, we obtain closed-form analytical bounds, theoretically validating that “deeper and narrower” architectures enhance generalization—thereby providing a rigorous information-theoretic foundation for the generalization advantage of depth.
📝 Abstract
Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the train and test distributions of the network internal representations. The KL divergence bound shrinks as the layer index increases, while the Wasserstein bound implies the existence of a layer that serves as a generalization funnel, which attains a minimal 1-Wasserstein distance. Analytic expressions for both bounds are derived under the setting of binary Gaussian classification with linear DNNs. To quantify the contraction of the relevant information measures when moving deeper into the network, we analyze the strong data processing inequality (SDPI) coefficient between consecutive layers of three regularized DNN models: $mathsf{Dropout}$, $mathsf{DropConnect}$, and Gaussian noise injection. This enables refining our generalization bounds to capture the contraction as a function of the network architecture parameters. Specializing our results to DNNs with a finite parameter space and the Gibbs algorithm reveals that deeper yet narrower network architectures generalize better in those examples, although how broadly this statement applies remains a question.