Information-Theoretic Generalization Bounds for Deep Neural Networks

📅 2024-04-04
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the fundamental “depth effect” on generalization in deep neural networks (DNNs). We derive the first layer-wise information-theoretic generalization bound, quantifying representation divergence between training and test distributions via KL divergence and 1-Wasserstein distance across hidden layers. Our analysis reveals that depth promotes generalization through progressive information compression. We introduce the novel concept of a “generalization funnel” layer and prove that the KL-based bound monotonically contracts with increasing network depth. Leveraging the strong data processing inequality (SDPI), we rigorously quantify information compression induced by Dropout, DropConnect, and Gaussian noise injection. Under linear DNNs and finite-parameter assumptions, we obtain closed-form analytical bounds, theoretically validating that “deeper and narrower” architectures enhance generalization—thereby providing a rigorous information-theoretic foundation for the generalization advantage of depth.

Technology Category

Application Category

📝 Abstract
Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the train and test distributions of the network internal representations. The KL divergence bound shrinks as the layer index increases, while the Wasserstein bound implies the existence of a layer that serves as a generalization funnel, which attains a minimal 1-Wasserstein distance. Analytic expressions for both bounds are derived under the setting of binary Gaussian classification with linear DNNs. To quantify the contraction of the relevant information measures when moving deeper into the network, we analyze the strong data processing inequality (SDPI) coefficient between consecutive layers of three regularized DNN models: $mathsf{Dropout}$, $mathsf{DropConnect}$, and Gaussian noise injection. This enables refining our generalization bounds to capture the contraction as a function of the network architecture parameters. Specializing our results to DNNs with a finite parameter space and the Gibbs algorithm reveals that deeper yet narrower network architectures generalize better in those examples, although how broadly this statement applies remains a question.
Problem

Research questions and friction points this paper is trying to address.

Study depth's impact on DNN generalization via information theory
Derive hierarchical bounds using KL divergence and Wasserstein distance
Analyze generalization in regularized DNNs with finite parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical bounds using KL divergence and Wasserstein distance
Analyzes SDPI coefficient for three regularized DNN models
Links deeper, narrower architectures to better generalization
🔎 Similar Papers
No similar papers found.