PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training Budgets

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing continuous pretraining (CPT) scaling laws assume a fixed pretraining budget and fail to generalize across varying tokens-per-parameter (PTPP) regimes, hindering accurate prediction of domain-adaptive performance under diverse resource constraints. Method: We propose a PTPP-aware scaling law that explicitly incorporates PTPP as a core variable, enabling cross-stage prediction of target-domain loss and principled planning of replay ratio and adaptation budget under resource constraints. Within a multilingual CPT framework, we quantify early-data predictive power using Huber-on-log loss, relative MAE, and calibration slope. Contribution/Results: Our method achieves accurate out-of-distribution prediction—e.g., models trained at PTPP = 15 or 31 reliably forecast target loss at PTPP = 279—significantly outperforming PTPP-agnostic baselines. It attains state-of-the-art performance across multiple evaluation metrics, enabling robust, resource-aware CPT scheduling and domain adaptation.

Technology Category

Application Category

📝 Abstract

Continual pre-training (CPT) for domain adaptation must balance target-domain gains with stability on the base domain. Existing CPT scaling laws typically assume a fixed pre-training budget, which limits their ability to forecast adaptation outcomes for models trained at different tokens-per-parameter (PTPP). We present emph{PTPP-aware} adaptation scaling laws that make the pre-training budget an explicit variable, enabling accurate emph{prediction} of adaptation loss at unseen ptpp. On a multilingual setup (English/Arabic $ ightarrow$ French), PTPP-aware formulations trained on early stages (ptpp{}={15,31}) predict target loss at ptpp{}=279 and outperform a PTPP-agnostic dcpt{} transfer baseline on metrics (Huber-on-log, MAE$_mathrm{rel}$, calibration slope); full diagnostics (RMSE, MAPE) are in the appendix. Beyond forecasting, we show a practical use case: planning replay ratios and adaptation token budgets that satisfy target and forgetting constraints under compute limits.

Problem

Research questions and friction points this paper is trying to address.

Predicting domain adaptation performance at unseen pre-training budgets

Balancing target domain gains with base domain stability

Planning replay ratios and token budgets under compute limits

Innovation

Methods, ideas, or system contributions that make the work stand out.

PTPP-aware scaling laws predict adaptation loss

Explicit pre-training budget variable enables forecasting

Optimizes replay ratios under computational constraints

🔎 Similar Papers

Investigating Continual Pretraining in Large Language Models: Insights and Implications