Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A general analytical framework for characterizing feature learning mechanisms and implicit bias remains lacking in deep learning theory—especially for deep nonlinear networks—where existing approaches rely on solving high-dimensional nonlinear equations, entailing prohibitive computational cost and poor interpretability. Method: We propose a first-principles-based scaling analysis paradigm, systematically applying it for the first time to three-layer nonlinear networks and attention heads. By introducing multiscale modeling that jointly accounts for data dimensionality and network width, we analytically derive critical conditions for the emergence of feature learning, circumventing numerical bottlenecks. Contribution/Results: Our method recovers established scaling laws while yielding novel, experimentally testable predictions. It substantially reduces theoretical complexity and extends the explanatory power of deep learning theory to realistic architectures.

Technology Category

Application Category

📝 Abstract
Two pressing topics in the theory of deep learning are the interpretation of feature learning mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich feature learning effects revolve around networks with one or two trainable layers or deep linear networks. Furthermore, even under such limiting settings, predictions often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this analytical complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of feature learning emerge. This form of scale analysis is considerably simpler than such exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the scope of first-principle theories of deep learning.
Problem

Research questions and friction points this paper is trying to address.

Predicts data and width scales for feature learning emergence
Simplifies scaling analysis compared to exact theories
Extends first-principle theories to complex toy architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale analysis heuristic predicts feature learning emergence
Simplifies exact theories by reproducing known scaling exponents
Extends to complex architectures like three-layer nonlinear networks
🔎 Similar Papers
No similar papers found.