🤖 AI Summary
Existing continuous pretraining (CPT) scaling laws assume a fixed pretraining budget and fail to generalize across varying tokens-per-parameter (PTPP) regimes, hindering accurate prediction of domain-adaptive performance under diverse resource constraints.
Method: We propose a PTPP-aware scaling law that explicitly incorporates PTPP as a core variable, enabling cross-stage prediction of target-domain loss and principled planning of replay ratio and adaptation budget under resource constraints. Within a multilingual CPT framework, we quantify early-data predictive power using Huber-on-log loss, relative MAE, and calibration slope.
Contribution/Results: Our method achieves accurate out-of-distribution prediction—e.g., models trained at PTPP = 15 or 31 reliably forecast target loss at PTPP = 279—significantly outperforming PTPP-agnostic baselines. It attains state-of-the-art performance across multiple evaluation metrics, enabling robust, resource-aware CPT scheduling and domain adaptation.
📝 Abstract
Continual pre-training (CPT) for domain adaptation must balance target-domain gains with stability on the base domain. Existing CPT scaling laws typically assume a fixed pre-training budget, which limits their ability to forecast adaptation outcomes for models trained at different tokens-per-parameter (PTPP). We present emph{PTPP-aware} adaptation scaling laws that make the pre-training budget an explicit variable, enabling accurate emph{prediction} of adaptation loss at unseen ptpp. On a multilingual setup (English/Arabic $
ightarrow$ French), PTPP-aware formulations trained on early stages (ptpp{}={15,31}) predict target loss at ptpp{}=279 and outperform a PTPP-agnostic dcpt{} transfer baseline on metrics (Huber-on-log, MAE$_mathrm{rel}$, calibration slope); full diagnostics (RMSE, MAPE) are in the appendix. Beyond forecasting, we show a practical use case: planning replay ratios and adaptation token budgets that satisfy target and forgetting constraints under compute limits.