🤖 AI Summary
To address the limitation of instance-level data selection in data-efficient pretraining—which neglects inter-sample interactions and leads to inaccurate influence modeling—this paper proposes Group-MATES, a group-level data influence modeling framework. Group-MATES partitions the dataset into semantically and relationally coupled groups, estimates group-level Oracle influence via local probing, constructs a relation-weighted collective influence model, and incorporates influence-aware hierarchical clustering to jointly optimize intra-group consistency and inter-group discriminability. Subset selection is then performed via group-level influence maximization. Evaluated on the DCLM benchmark, Group-MATES achieves a 10% improvement in core score over strong baselines and outperforms state-of-the-art instance-level influence methods by 5%, establishing new SOTA. This work provides the first empirical validation that explicitly modeling pairwise (or higher-order) data interactions yields significant gains in pretraining data selection.
📝 Abstract
Data-efficient pretraining has shown tremendous potential to elevate scaling laws. This paper argues that effective pretraining data should be curated at the group level, treating a set of data points as a whole rather than as independent contributors. To achieve that, we propose Group-Level Data Influence Modeling (Group-MATES), a novel data-efficient pretraining method that captures and optimizes group-level data utility. Specifically, Group-MATES collects oracle group-level influences by locally probing the pretraining model with data sets. It then fine-tunes a relational data influence model to approximate oracles as relationship-weighted aggregations of individual influences. The fine-tuned model selects the data subset by maximizing its group-level influence prediction, with influence-aware clustering to enable efficient inference. Experiments on the DCLM benchmark demonstrate that Group-MATES achieves a 10% relative core score improvement on 22 downstream tasks over DCLM-Baseline and 5% over individual-influence-based methods, establishing a new state-of-the-art. Further analyses highlight the effectiveness of relational data influence models in capturing intricate interactions between data points.