🤖 AI Summary
Traditional scaling laws rely solely on model and data scale to predict performance, neglecting other hyperparameters and thus struggling to achieve accurate prediction and efficient tuning under hardware constraints. This work proposes Configuration-to-Performance Scaling Laws (CPL), which, for the first time, incorporate the full training configuration into the modeling framework. By parameterizing this mapping with a large language model, the authors introduce a neuralized CPL (NCPL). Trained on open-source pretraining logs, NCPL enables joint optimization across multiple hyperparameters and predicts loss curves with 20–40% lower error than Chinchilla scaling laws. It generalizes effectively to regimes up to ten times the maximum compute budget observed in the training set and matches baseline methods in multi-hyperparameter tuning tasks.
📝 Abstract
Researchers build scaling laws to forecast the training performance of expensive large-scale runs with larger model size N and data size D. These laws assume that other training hyperparameters are optimally chosen, which can require significant effort and, in some cases, be impossible due to external hardware constraints. To improve predictability across a broader set of hyperparameters and enable simpler tuning at scale, we propose learning a \textit{Configuration-to-Performance Scaling Law} (CPL): a mapping from the \textit{full training configuration} to training performance. Because no simple functional form can express this mapping, we parameterize it with a large language model (LLM), and fit it with diverse open-source pretraining logs across multiple sources, yielding a \textit{Neural} Configuration-to-Performance Scaling Law (NCPL). NCPL accurately predicts how training configurations influence the final pretraining loss, achieving 20-40% lower prediction error than the configuration-agnostic Chinchilla law and generalizing to runs using up to 10 x more compute than any run in the training set. It further supports joint tuning of multiple hyperparameters with performance comparable to hyperparameter scaling law baselines. Finally, NCPL naturally and effectively extends to richer prediction targets such as loss-curve prediction.