🤖 AI Summary
Existing LLM scaling laws lack cross-family generalizability due to methodological and data heterogeneity across training paradigms, and require costly repeated training. This paper introduces Skills Scaling Laws (Sloth), the first framework to attribute LLM performance to transferable, low-dimensional latent skills—such as reasoning and instruction-following—and model their heterogeneous dependence on computational resources. Sloth integrates multi-task correlation analysis, latent variable identification, and mapping into a low-dimensional skill space, estimating cross-family scaling parameters solely from publicly available benchmark data. Evaluated on 12 benchmarks from the Open LLM Leaderboard v1/v2, Sloth significantly improves cross-family performance prediction accuracy. It further uncovers task-specific scaling behaviors—e.g., distinct scaling patterns for coding versus emotional intelligence—while maintaining strong interpretability and high predictive fidelity.
📝 Abstract
Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance efficiently and offers insights into scaling behaviors for downstream tasks such as coding and emotional intelligence applications.