🤖 AI Summary
This paper identifies a statistical suboptimality of stochastic gradient descent (SGD) in neural network feature learning under non-isotropic inputs and benign conditioning—namely, its inability to efficiently extract discriminative features. To address this, the authors propose layer-wise preconditioning and, for the first time, rigorously establish its statistical necessity for provably effective feature learning—moving beyond prior empirical analyses. Leveraging tools from linear representation learning, single-index models, matrix perturbation theory, and optimization dynamics modeling, they prove that layer-wise preconditioning restores optimal convergence rates. Numerical experiments demonstrate that the method significantly outperforms Adam(W) and BatchNorm. The core contribution is the theoretical establishment of both the statistical necessity and efficacy of layer-wise preconditioning for feature learning—providing the first formal guarantee of its essential role in enabling efficient, discriminative representation acquisition.
📝 Abstract
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $mathbf{x} sim mathsf{N}(mathbf{0}, mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.