🤖 AI Summary
Traditional scaling laws based on validation perplexity struggle to accurately predict downstream task performance as model scale increases, primarily due to their neglect of task-specific characteristics and reliance on predefined functional forms. This work reframes scaling law modeling as a time series extrapolation problem and introduces the first end-to-end temporal neural network architecture that jointly leverages historical task accuracy and fine-grained per-token validation loss. By learning multi-task scaling behaviors directly from data, the proposed model eliminates the need for fixed parametric assumptions. Evaluated across 66 downstream tasks, it achieves an average absolute error of 2.04%, representing a 38% improvement over logistic scaling laws, and demonstrates strong zero-shot generalization to unseen model families, scales, and tasks.
📝 Abstract
Neural scaling laws predict how language model performance improves with increased compute. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation perplexity suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without assuming any bottleneck or functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 38% reduction compared to logistic scaling laws (3.29% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.