Learning curves theory for hierarchically compositional data with power-law distributed features

📅 2025-05-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the theoretical foundations of neural scaling laws in hierarchical compositional data (e.g., language, images), seeking a unified explanation for their emergence from the interplay between power-law-distributed features and hierarchical structure. Method: We model hierarchical generative processes using probabilistic context-free grammars (PCFGs), and integrate theoretical learning curve analysis with asymptotic complexity derivation. Contribution/Results: We provide the first unified account linking power-law feature distributions and hierarchical compositionality. For classification, learning curves follow power-law decay whose exponent is determined by the power-law exponent of production rule probabilities, while the multiplicative constant scales exponentially with hierarchy depth. In next-token prediction, the large-scale scaling exponent decouples from the rule distribution and depends only on global grammar structure. These results reveal task-dependent mechanistic origins of scaling behavior, establishing a coherent theoretical framework for understanding neural generalization and scaling across modalities.

Technology Category

Application Category

📝 Abstract
Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars -- probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules' distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the local details of the learning curve, but not the exponent describing the large-scale behaviour.
Problem

Research questions and friction points this paper is trying to address.

Unify theories on Neural Scaling Laws origins
Analyze learning curves for hierarchical compositional data
Study power-law impact on classification and prediction tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies power-law distributed features with hierarchical structures
Uses probabilistic context-free grammars for data generation
Analyzes learning curves for classification and token prediction
F
Francesco Cagnetta
Theoretical and Scientific Data Science, SISSA, Trieste, Italy
Hyunmo Kang
Hyunmo Kang
Seoul national university
physicsmachine learning
Matthieu Wyart
Matthieu Wyart
Professor of Physics, Johns Hopkins
physics