Learning curves theory for hierarchically compositional data with power-law distributed features

📅 2025-05-11

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the theoretical foundations of neural scaling laws in hierarchical compositional data (e.g., language, images), seeking a unified explanation for their emergence from the interplay between power-law-distributed features and hierarchical structure. Method: We model hierarchical generative processes using probabilistic context-free grammars (PCFGs), and integrate theoretical learning curve analysis with asymptotic complexity derivation. Contribution/Results: We provide the first unified account linking power-law feature distributions and hierarchical compositionality. For classification, learning curves follow power-law decay whose exponent is determined by the power-law exponent of production rule probabilities, while the multiplicative constant scales exponentially with hierarchy depth. In next-token prediction, the large-scale scaling exponent decouples from the rule distribution and depends only on global grammar structure. These results reveal task-dependent mechanistic origins of scaling behavior, establishing a coherent theoretical framework for understanding neural generalization and scaling across modalities.

Technology Category

Application Category

📝 Abstract

Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars -- probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules' distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the local details of the learning curve, but not the exponent describing the large-scale behaviour.

Problem

Research questions and friction points this paper is trying to address.

Unify theories on Neural Scaling Laws origins

Analyze learning curves for hierarchical compositional data

Study power-law impact on classification and prediction tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies power-law distributed features with hierarchical structures

Uses probabilistic context-free grammars for data generation

Analyzes learning curves for classification and token prediction

🔎 Similar Papers

A spring-block theory of feature learning in deep neural networks