Neural Neural Scaling Laws

📅 2026-01-27

📈 Citations: 1

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Traditional scaling laws based on validation perplexity struggle to accurately predict downstream task performance as model scale increases, primarily due to their neglect of task-specific characteristics and reliance on predefined functional forms. This work reframes scaling law modeling as a time series extrapolation problem and introduces the first end-to-end temporal neural network architecture that jointly leverages historical task accuracy and fine-grained per-token validation loss. By learning multi-task scaling behaviors directly from data, the proposed model eliminates the need for fixed parametric assumptions. Evaluated across 66 downstream tasks, it achieves an average absolute error of 2.04%, representing a 38% improvement over logistic scaling laws, and demonstrates strong zero-shot generalization to unseen model families, scales, and tasks.

Technology Category

Application Category

📝 Abstract

Neural scaling laws predict how language model performance improves with increased compute. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation perplexity suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without assuming any bottleneck or functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 38% reduction compared to logistic scaling laws (3.29% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.

Problem

Research questions and friction points this paper is trying to address.

neural scaling laws

downstream tasks

model scaling

performance prediction

validation perplexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Scaling Laws

Time-series Extrapolation

Zero-shot Generalization