Influence-driven Curriculum Learning for Pre-training on Limited Data

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Conventional hand-crafted curriculum learning methods exhibit limited efficacy in language model pretraining under data-scarce regimes due to their static, non-adaptive difficulty metrics. Method: This paper proposes a model-centric curriculum learning paradigm that replaces static human-defined difficulty criteria with dynamic, learnable sample difficulty quantified by training data influence—the degree to which each training instance affects model parameter updates—enabling real-time, adaptive data reordering and staged presentation during pretraining. Contribution/Results: Experiments on standard benchmarks demonstrate that the proposed influence-driven curriculum significantly outperforms random training, yielding over 10 percentage points improvement in downstream task performance. This work provides the first empirical validation of data-influence-based curricula in large-scale pretraining, establishing both effectiveness and scalability. It introduces a novel paradigm for few-shot pretraining, shifting focus from fixed heuristics to model-intrinsic signals for curriculum design.

Technology Category

Application Category

📝 Abstract

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their extit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

Problem

Research questions and friction points this paper is trying to address.

Improving curriculum learning for pre-training language models

Replacing human-centered difficulty metrics with model-centric ones

Using training data influence to sort examples by difficulty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using training data influence as difficulty metric

Model-centric curriculum learning for pre-training

Outperforming random order training by 10+ points

🔎 Similar Papers

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review