Deep Neural Network Training as Random Effects: An Optimization-Inference Duality

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the long-standing reliance on empirical heuristics in selecting training duration for over-parameterized deep neural networks, which lacks rigorous statistical foundations. It establishes, for the first time, a precise equivalence between the continuous-time neural tangent kernel (NTK) gradient flow in deep learning and linear mixed-effects models, revealing a duality between optimization trajectories and empirical Bayes inference. By interpreting training time as a variance component, the study formulates early stopping as a statistical inference problem and proposes a restricted maximum likelihood (REML)-driven stopping criterion. This approach yields asymptotically optimal in-sample prediction error under fixed design and demonstrates strong generalization performance under random design, while offering an interpretable spectral-domain stopping rule grounded in likelihood principles.
📝 Abstract
Deep neural networks (DNNs) have achieved remarkable empirical success, yet their training dynamics remain understood mainly from optimization rather than statistical principles. Here we develop a statistical framework for DNN training in the over-parameterized regime by showing that the prediction induced by continuous-time neural tangent kernel (NTK) gradient flow is exactly equivalent to that from a classical random-effects model. In this framework, training time acts as a variance component, or equivalently an empirical Bayes covariance hyperparameter, governing the allocation of variation from noise to structured signal. This equivalence reveals an optimization-inference duality: the gradient-flow path is both an optimization trajectory and an empirical Bayes random-effects inference path. Conditional on training time, the network output is the posterior mean of the latent signal, and estimating training time by restricted maximum likelihood (REML) turns early stopping into likelihood-based empirical Bayes inference rather than external tuning. This perspective yields a two-stage inferential procedure. First, a variance-component test determines whether DNN training captures statistically significant structure beyond initialization. Second, conditional on training being warranted, REML provides a likelihood-based early stopping rule. The resulting stopping time admits a spectral interpretation in the NTK eigenbasis, where training proceeds until spectral loss decorrelation is achieved. We further establish that REML-guided early stopping achieves asymptotically optimal prediction error for fixed-design in-sample prediction and, under additional random-design regularity conditions, for out-of-sample prediction. This work reframes DNN training as statistical inference and provides a principled foundation for deciding whether and how long to train deep neural networks.
Problem

Research questions and friction points this paper is trying to address.

deep neural networks
random effects
early stopping
empirical Bayes
optimization-inference duality
Innovation

Methods, ideas, or system contributions that make the work stand out.

optimization-inference duality
random-effects model
neural tangent kernel
empirical Bayes
restricted maximum likelihood