Data-Efficient Pretraining with Group-Level Data Influence Modeling

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitation of instance-level data selection in data-efficient pretraining—which neglects inter-sample interactions and leads to inaccurate influence modeling—this paper proposes Group-MATES, a group-level data influence modeling framework. Group-MATES partitions the dataset into semantically and relationally coupled groups, estimates group-level Oracle influence via local probing, constructs a relation-weighted collective influence model, and incorporates influence-aware hierarchical clustering to jointly optimize intra-group consistency and inter-group discriminability. Subset selection is then performed via group-level influence maximization. Evaluated on the DCLM benchmark, Group-MATES achieves a 10% improvement in core score over strong baselines and outperforms state-of-the-art instance-level influence methods by 5%, establishing new SOTA. This work provides the first empirical validation that explicitly modeling pairwise (or higher-order) data interactions yields significant gains in pretraining data selection.

Technology Category

Application Category

📝 Abstract
Data-efficient pretraining has shown tremendous potential to elevate scaling laws. This paper argues that effective pretraining data should be curated at the group level, treating a set of data points as a whole rather than as independent contributors. To achieve that, we propose Group-Level Data Influence Modeling (Group-MATES), a novel data-efficient pretraining method that captures and optimizes group-level data utility. Specifically, Group-MATES collects oracle group-level influences by locally probing the pretraining model with data sets. It then fine-tunes a relational data influence model to approximate oracles as relationship-weighted aggregations of individual influences. The fine-tuned model selects the data subset by maximizing its group-level influence prediction, with influence-aware clustering to enable efficient inference. Experiments on the DCLM benchmark demonstrate that Group-MATES achieves a 10% relative core score improvement on 22 downstream tasks over DCLM-Baseline and 5% over individual-influence-based methods, establishing a new state-of-the-art. Further analyses highlight the effectiveness of relational data influence models in capturing intricate interactions between data points.
Problem

Research questions and friction points this paper is trying to address.

Enhance data-efficient pretraining with group-level curation.
Propose Group-MATES for optimizing group-level data utility.
Improve downstream task performance via relational data influence models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group-Level Data Influence Modeling
Relational Data Influence Model
Influence-Aware Clustering
🔎 Similar Papers
No similar papers found.