How Feature Learning Can Improve Neural Scaling Laws

📅 2024-09-26

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work investigates how feature learning affects neural scaling laws, specifically examining how task difficulty—categorized as hard, easy, or trivial—modulates performance scaling with model size, training steps, and dataset size. Methodologically, the study integrates infinite-width NTK theory, RKHS spectral analysis, empirical evaluation on nonlinear MLPs and CNNs, and power-law Fourier spectrum fitting. Theoretically, it establishes for the first time that feature learning approximately doubles the training-time and compute scaling exponents for hard tasks, and derives the computationally optimal scaling strategy. Experiments on function approximation on the circle and vision benchmarks confirm that feature learning markedly improves scaling efficiency for hard tasks, while yielding no gains for easy or trivial tasks. Theoretical predictions align closely with empirical results across all settings.

Technology Category

Application Category

📝 Abstract

We develop a solvable model of neural scaling laws beyond the kernel limit. Theoretical analysis of this model shows how performance scales with model size, training time, and the total amount of available data. We identify three scaling regimes corresponding to varying task difficulties: hard, easy, and super easy tasks. For easy and super-easy target functions, which lie in the reproducing kernel Hilbert space (RKHS) defined by the initial infinite-width Neural Tangent Kernel (NTK), the scaling exponents remain unchanged between feature learning and kernel regime models. For hard tasks, defined as those outside the RKHS of the initial NTK, we demonstrate both analytically and empirically that feature learning can improve scaling with training time and compute, nearly doubling the exponent for hard tasks. This leads to a different compute optimal strategy to scale parameters and training time in the feature learning regime. We support our finding that feature learning improves the scaling law for hard tasks but not for easy and super-easy tasks with experiments of nonlinear MLPs fitting functions with power-law Fourier spectra on the circle and CNNs learning vision tasks.

Problem

Research questions and friction points this paper is trying to address.

Understand scaling laws for model size, training time, and data.

Compare feature learning and kernel regimes for task difficulties.

Demonstrate improved scaling for hard tasks via feature learning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature learning improves neural scaling laws

Different scaling regimes for task difficulties

Analytical and empirical validation of improvements

🔎 Similar Papers

No similar papers found.