Gemstones: A Model Suite for Multi-Faceted Scaling Laws

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Standard scaling laws exhibit significant prediction errors under varying architectural configurations (width, depth, MLP ratio) and training hyperparameters (learning rate, annealing, cooldown), yet prior work is limited to single model families. Method: We construct an orthogonalized training framework based on Transformer architectures, decoupling architectural and optimization variables. We introduce Gemstones—the largest publicly available scaling law dataset to date—comprising 4,000+ checkpoints, up to 2B parameters, and systematically covering joint width-depth variations alongside diverse optimizer configurations. Contribution/Results: Empirical analysis reveals that scaling law fidelity degrades substantially across hyperparameter settings, with experimental design exerting greater influence on outcomes than previously assumed. This work enables, for the first time, rigorous coupled width–depth modeling and other advanced scaling analyses. It establishes a critical benchmark, delivers methodological cautions against overgeneralization, and provides open-source infrastructure to advance robust scaling law research.

Technology Category

Application Category

📝 Abstract

Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using a wide range of architecture and hyper-parameter choices, and highlight their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: the most comprehensive open-source scaling law dataset to date, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters; these models have been trained with different learning rates, cooldown schedules, and architectural shapes. Our checkpoints enable more complex studies of scaling, such as a law that predicts language modeling performance as a function of model width and depth. By examining the various facets of our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting. Code: https://github.com/mcleish7/gemstone-scaling-laws

Problem

Research questions and friction points this paper is trying to address.

Analyzes scaling laws impact

Examines diverse hyper-parameter effects

Provides comprehensive open-source dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wide architecture and hyper-parameter range

Comprehensive open-source scaling law dataset

Predicts performance via model width and depth

🔎 Similar Papers

No similar papers found.