Zero-Shot Performance Prediction for Probabilistic Scaling Laws

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the high computational cost and extensive training data requirements of learning curve prediction for NLP models. We propose a zero-shot predictive framework that eliminates the need for auxiliary model training. Methodologically, we design a two-level hierarchical multi-task learning architecture and introduce a latent-variable-based multi-output Gaussian process (MOGP) to jointly capture inter-task and intra-level dependencies; active learning is further integrated to reduce predictive uncertainty. To our knowledge, this is the first approach enabling probabilistic learning curve prediction without any additional model training, thereby facilitating low-cost scaling law construction. We validate the framework on nanoGPT, mBART, Transformer, and M2M100 across three small-scale NLP datasets. It successfully predicts up to 30 learning curves per model, achieving significantly lower prediction error than baselines while drastically reducing both computational overhead and annotation effort.

Technology Category

Application Category

📝 Abstract

The prediction of learning curves for Natural Language Processing (NLP) models enables informed decision-making to meet specific performance objectives, while reducing computational overhead and lowering the costs associated with dataset acquisition and curation. In this work, we formulate the prediction task as a multitask learning problem, where each task's data is modelled as being organized within a two-layer hierarchy. To model the shared information and dependencies across tasks and hierarchical levels, we employ latent variable multi-output Gaussian Processes, enabling to account for task correlations and supporting zero-shot prediction of learning curves (LCs). We demonstrate that this approach facilitates the development of probabilistic scaling laws at lower costs. Applying an active learning strategy, LCs can be queried to reduce predictive uncertainty and provide predictions close to ground truth scaling laws. We validate our framework on three small-scale NLP datasets with up to $30$ LCs. These are obtained from nanoGPT models, from bilingual translation using mBART and Transformer models, and from multilingual translation using M2M100 models of varying sizes.

Problem

Research questions and friction points this paper is trying to address.

Predicts NLP learning curves to reduce computational costs and data expenses

Uses multitask Gaussian Processes for zero-shot learning curve forecasting

Develops probabilistic scaling laws through active learning with limited data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multitask learning with hierarchical data modeling

Latent variable multi-output Gaussian Processes for correlations

Active learning strategy reduces predictive uncertainty

🔎 Similar Papers

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?