Theoretical Foundations of Scaling Law in Familial Models

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Existing neural scaling laws apply only to monolithic dense models and fail to characterize “family models”—a paradigm comprising multiple submodels derived from a shared backbone, supporting early exit and relay-style inference, and enabling heterogeneous deployment across devices, edge, and cloud. Method: We introduce granularity (G) as the third fundamental scaling variable alongside model size (N) and training token count (D), establishing a unified scaling law L(N, D, G). Using IsoFLOP experimental design and multivariate parameterization, we empirically identify a multiplicative power-law granularity penalty with an extremely small exponent. Results: We demonstrate that N, G, and D are decoupled and independently scalable under fixed compute budgets. Family models thus preserve computational optimality while substantially enhancing deployment flexibility. This work provides the first empirically validated scaling theory framework for ubiquitous intelligence.

Technology Category

Application Category

📝 Abstract

Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks "Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this "one-run, many-models" paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the "train once, deploy many" paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.

Problem

Research questions and friction points this paper is trying to address.

Extend scaling laws to familial models with granularity as a variable

Quantify the relationship between model size, tokens, and granularity

Validate train-once-deploy-many paradigm without sacrificing compute efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces granularity G as a scaling variable alongside N and D

Employs IsoFLOP design to isolate architectural impact from compute

Reveals granularity penalty follows multiplicative power law with small exponent

🔎 Similar Papers

No similar papers found.