Gemstones: A Model Suite for Multi-Faceted Scaling Laws

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard scaling laws exhibit significant prediction errors under varying architectural configurations (width, depth, MLP ratio) and training hyperparameters (learning rate, annealing, cooldown), yet prior work is limited to single model families. Method: We construct an orthogonalized training framework based on Transformer architectures, decoupling architectural and optimization variables. We introduce Gemstones—the largest publicly available scaling law dataset to date—comprising 4,000+ checkpoints, up to 2B parameters, and systematically covering joint width-depth variations alongside diverse optimizer configurations. Contribution/Results: Empirical analysis reveals that scaling law fidelity degrades substantially across hyperparameter settings, with experimental design exerting greater influence on outcomes than previously assumed. This work enables, for the first time, rigorous coupled width–depth modeling and other advanced scaling analyses. It establishes a critical benchmark, delivers methodological cautions against overgeneralization, and provides open-source infrastructure to advance robust scaling law research.

Technology Category

Application Category

📝 Abstract
Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using a wide range of architecture and hyper-parameter choices, and highlight their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: the most comprehensive open-source scaling law dataset to date, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters; these models have been trained with different learning rates, cooldown schedules, and architectural shapes. Our checkpoints enable more complex studies of scaling, such as a law that predicts language modeling performance as a function of model width and depth. By examining the various facets of our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting. Code: https://github.com/mcleish7/gemstone-scaling-laws
Problem

Research questions and friction points this paper is trying to address.

Analyzes scaling laws impact
Examines diverse hyper-parameter effects
Provides comprehensive open-source dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Wide architecture and hyper-parameter range
Comprehensive open-source scaling law dataset
Predicts performance via model width and depth
🔎 Similar Papers
No similar papers found.