🤖 AI Summary
This work addresses the lack of effective regularization in wide feature-learning neural networks under gradient flow training, which often disrupts the implicit priors of pretrained models. The authors propose a learning-mechanism-agnostic energy framework in function space, extending canonical regularization—previously limited to fixed-feature settings—to feature-learning scenarios for the first time. They develop a Riemannian-geometric geodesic ridge regularization and introduce a scalable arc-ridge approximation. This approach establishes a Riemannian Gibbs process as a function-space prior, revealing an intrinsic connection between early stopping and canonical regularization. Experiments demonstrate that the proposed method significantly outperforms classical ridge regularization in both image processing and NLP transfer tasks, effectively mitigating its detrimental impact on pretrained models.
📝 Abstract
Wide neural networks in the feature-learning regime drive modern deep learning, and yet they remain far less studied than their kernel-regime counterparts. We consider a critical yet under-explored difference between these two regimes: the regulariser and prior implied by gradient flow training. This canonical regularisation property is well-studied in kernel regime networks -- of all the infinite global minima, gradient flow selects exactly the vanishing ridge solution -- and underpins the celebrated NN-GP correspondence, precisely allowing the modelling of noise during training. However, we prove ridge regularisation biases gradient flow in feature-learning regime networks, even in the infinitesimal limit of vanishing regularisation. Over training, ridge distorts the inductive bias of the network, with a particular damage done to pretrained networks where the implicit prior is informative. We resolve this by axiomatising the canonical regulariser as a regime-agnostic function-space energy and lift, which uniquely identifies ridge in the kernel regime, and crucially generalises to the feature-learning regime. By studying the Riemannian geometry of feature-learning networks, we derive geodesic ridge from our framework, generalising ridge to the feature-learning regime. Correspondingly, we prove the canonical function-space prior is a Riemannian Gibbs Process, generalising the more familiar Gaussian Process. As a practical contribution, we propose arc ridge as a minimax-robust, scalable surrogate to geodesic ridge, revealing a deep relationship between early stopping and canonical regularisation across learning regimes. Finally, we demonstrate the consequences of our theory empirically on both image processing and NLP transfer-learning problems.