🤖 AI Summary
Understanding how distributional shifts in training data affect language models of varying scales—and whether small proxy models can reliably predict the data sensitivity of large models—remains an open challenge.
Method: We conduct 12 controlled data perturbation experiments across multi-scale models (LLaMA, Pythia), employing predictive correlation analysis, TracIn-based data attribution, and subset selection evaluation.
Contribution/Results: We provide the first systematic empirical evidence that small and large models exhibit highly consistent responses to data perturbations (mean predictive correlation = 0.89), directly refuting the conventional assumption that model scaling undermines the transferability of data effects. This finding establishes a foundational theoretical basis for data-centric AI and substantially improves the generalization reliability of proxy models in data attribution (+23% accuracy) and optimal subset selection (+31% utility).
📝 Abstract
Choice of training data distribution greatly influences model behavior. Yet, in large-scale settings, precisely characterizing how changes in training data affects predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data distribution influence model behavior across compute scale? We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data. Equipped with these findings, we characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.