Analytic theory of dropout regularization

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the theoretical mechanisms and optimal configuration of dropout regularization in two-layer neural networks. We develop a high-dimensional analytical theory under online stochastic gradient descent (SGD) by modeling training dynamics via stochastic differential equations, applying mean-field approximations, and leveraging statistical physics techniques. This yields a system of ordinary differential equations governing the learning dynamics in the high-dimensional limit. For the first time, we derive closed-form expressions for both generalization error and the optimal dropout rate across short-, medium-, and long-term training phases. Our analysis reveals that dropout enhances robustness to label noise by decoupling correlations among hidden units, and that its optimal retention probability dynamically adapts to both training progress and data noise level. All theoretical predictions are validated with high fidelity against large-scale numerical experiments, providing the first rigorous, quantitative foundation for principled dropout design.

Technology Category

Application Category

📝 Abstract
Dropout is a regularization technique widely used in training artificial neural networks to mitigate overfitting. It consists of dynamically deactivating subsets of the network during training to promote more robust representations. Despite its widespread adoption, dropout probabilities are often selected heuristically, and theoretical explanations of its success remain sparse. Here, we analytically study dropout in two-layer neural networks trained with online stochastic gradient descent. In the high-dimensional limit, we derive a set of ordinary differential equations that fully characterize the evolution of the network during training and capture the effects of dropout. We obtain a number of exact results describing the generalization error and the optimal dropout probability at short, intermediate, and long training times. Our analysis shows that dropout reduces detrimental correlations between hidden nodes, mitigates the impact of label noise, and that the optimal dropout probability increases with the level of noise in the data. Our results are validated by extensive numerical simulations.
Problem

Research questions and friction points this paper is trying to address.

Analyzes dropout regularization in neural networks theoretically
Determines optimal dropout probabilities for different training phases
Explains how dropout reduces node correlations and noise impact
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analytic study of dropout in two-layer networks
Derived ODEs for training dynamics with dropout
Optimal dropout probability varies with noise
🔎 Similar Papers
No similar papers found.
F
Francesco Mori
Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Oxford OX1 3PU, United Kingdom
Francesca Mignacco
Francesca Mignacco
Princeton University & City University of New York
Statistical physicsMachine LearningTheoretical Neuroscience