The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks

📅 2023-11-30

🏛️ Neural Information Processing Systems

📈 Citations: 4

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the problem of interpretable control over feature learning in deep neural networks. Methodologically, it introduces the “feature velocity formula,” establishing for the first time an explicit analytical relationship among inter-layer feature update magnitude, the backpropagation angle θₗ, loss decay rate, and gradient norm. It characterizes the degeneration mechanism of θₗ and proves that branch scaling in ResNets effectively mitigates angular degeneration. Leveraging NTK spectral analysis, inter-layer Jacobian condition number modeling, and asymptotic derivation under the wide-and-deep limit, the framework theoretically recovers classical hyperparameter scaling laws and proposes a novel scaling scheme for deep ReLU MLPs—ensuring non-degenerate feature learning and stable convergence. The resulting theory enables direct, interpretable control over initialization scale, learning rate, and other hyperparameters. Collectively, this work establishes a new paradigm bridging theoretical understanding and practical design in deep learning.

📝 Abstract

Deep learning succeeds by doing hierarchical feature learning, yet tuning hyper-parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we introduce a key notion to predict and control feature learning: the angle $ heta_ell$ between the feature updates and the backward pass (at layer index $ell$). We show that the magnitude of feature updates after one GD step, at any training time, can be expressed via a simple and general emph{feature speed formula} in terms of this angle $ heta_ell$, the loss decay, and the magnitude of the backward pass. This angle $ heta_ell$ is controlled by the conditioning of the layer-to-layer Jacobians and at random initialization, it is determined by the spectrum of a certain kernel, which coincides with the Neural Tangent Kernel when $ell= ext{depth}$. Given $ heta_ell$, the feature speed formula provides us with rules to adjust HPs (scales and learning rates) so as to satisfy certain dynamical properties, such as feature learning and loss decay. We investigate the implications of our approach for ReLU MLPs and ResNets in the large width-then-depth limit. Relying on prior work, we show that in ReLU MLPs with iid initialization, the angle degenerates with depth as $cos( heta_ell)=Theta(1/sqrt{ell})$. In contrast, ResNets with branch scale $O(1/sqrt{ ext{depth}})$ maintain a non-degenerate angle $cos( heta_ell)=Theta(1)$. We use these insights to recover key properties of known HP scalings and also to introduce a new HP scaling for large depth ReLU MLPs with favorable theoretical properties.

Problem

Research questions and friction points this paper is trying to address.

Predict and control feature learning in deep neural networks

Adjust hyper-parameters to ensure feature learning and loss decay

Analyze angle dynamics in ReLU MLPs and ResNets for scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature speed formula predicts updates via angles

Adjust hyper-parameters using layer Jacobians conditioning

ResNets maintain non-degenerate angles with branch scaling

🔎 Similar Papers

No similar papers found.