🤖 AI Summary
This work investigates the convergence of training dynamics in residual neural networks (ResNets) as depth \(L\), width \(M\), and embedding dimension \(D\) jointly tend to infinity. Focusing on ResNets with two-layer perceptron residual blocks under the Maximal Local Update (MLU) regime, the analysis combines the cavity method with functional-level propagation of chaos to characterize weight updates via a skeleton mapping as functions of historical CLT-type sums. This yields the first rigorous quantitative convergence bounds for DMFT-like limits, with an error bound of \(O(1/L + \sqrt{D/(L M)} + 1/\sqrt{D})\). Under a parameter budget \(P = \Theta(LMD)\), this achieves the optimal scaling rate \(O(P^{-1/6})\). The bound is empirically tight in embedding space and applies to mainstream architectures including Transformers.
📝 Abstract
We establish convergence of the training dynamics of residual neural networks (ResNets) to their joint infinite depth L, hidden width M, and embedding dimension D limit. Specifically, we consider ResNets with two-layer perceptron blocks in the maximal local feature update (MLU) regime and prove that, after a bounded number of training steps, the error between the ResNet and its large-scale limit is O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). This error rate is empirically tight when measured in embedding space. For a budget of P = Theta(L M D) parameters, this yields a convergence rate O(P^(-1/6)) for the scalings of (L, M, D) that minimize the bound. Our analysis exploits in an essential way the depth-two structure of residual blocks and applies formally to a broad class of state-of-the-art architectures, including Transformers with bounded key-query dimension. From a technical viewpoint, this work completes the program initiated in the companion paper [Chi25] where it is proved that for a fixed embedding dimension D, the training dynamics converges to a Mean ODE dynamics at rate O(1/L + sqrt(D)/sqrt(L M)). Here, we study the large-D limit of this Mean ODE model and establish convergence at rate O(1/sqrt(D)), yielding the above bound by a triangle inequality. To handle the rich probabilistic structure of the limit dynamics and obtain one of the first rigorous quantitative convergence for a DMFT-type limit, we combine the cavity method with propagation of chaos arguments at a functional level on so-called skeleton maps, which express the weight updates as functions of CLT-type sums from the past.