Small-to-Large Generalization: Data Influences Models Consistently Across Scale

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Understanding how distributional shifts in training data affect language models of varying scales—and whether small proxy models can reliably predict the data sensitivity of large models—remains an open challenge. Method: We conduct 12 controlled data perturbation experiments across multi-scale models (LLaMA, Pythia), employing predictive correlation analysis, TracIn-based data attribution, and subset selection evaluation. Contribution/Results: We provide the first systematic empirical evidence that small and large models exhibit highly consistent responses to data perturbations (mean predictive correlation = 0.89), directly refuting the conventional assumption that model scaling undermines the transferability of data effects. This finding establishes a foundational theoretical basis for data-centric AI and substantially improves the generalization reliability of proxy models in data attribution (+23% accuracy) and optimal subset selection (+31% utility).

Technology Category

Application Category

📝 Abstract

Choice of training data distribution greatly influences model behavior. Yet, in large-scale settings, precisely characterizing how changes in training data affects predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data distribution influence model behavior across compute scale? We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data. Equipped with these findings, we characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.

Problem

Research questions and friction points this paper is trying to address.

How training data distribution affects model behavior across scales

Challenges in characterizing data impact due to high training costs

Effectiveness of proxy models in data attribution and selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proxy models extrapolate data effects across scales

Training data distribution impacts model behavior consistently

Small-large model predictions correlate across data choices

🔎 Similar Papers

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling