🤖 AI Summary
This work addresses the theoretical foundations of neural scaling laws in hierarchical compositional data (e.g., language, images), seeking a unified explanation for their emergence from the interplay between power-law-distributed features and hierarchical structure.
Method: We model hierarchical generative processes using probabilistic context-free grammars (PCFGs), and integrate theoretical learning curve analysis with asymptotic complexity derivation.
Contribution/Results: We provide the first unified account linking power-law feature distributions and hierarchical compositionality. For classification, learning curves follow power-law decay whose exponent is determined by the power-law exponent of production rule probabilities, while the multiplicative constant scales exponentially with hierarchy depth. In next-token prediction, the large-scale scaling exponent decouples from the rule distribution and depends only on global grammar structure. These results reveal task-dependent mechanistic origins of scaling behavior, establishing a coherent theoretical framework for understanding neural generalization and scaling across modalities.
📝 Abstract
Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars -- probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules' distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the local details of the learning curve, but not the exponent describing the large-scale behaviour.