๐ค AI Summary
To address the inefficiency of foundation model training caused by high noise levels and label scarcity in internet-scale data, this paper proposes Mimic Scoreโa novel data quality metric. It leverages a pre-trained reference model to automatically assess the utility of individual samples for training new models by quantifying the alignment between their parameter-space gradients and those of the reference model. This approach pioneers gradient-direction alignment as the core criterion for data quality assessment, eliminating reliance on human annotations or downstream task validation. Based on Mimic Score, we introduce Grad-Mimicโan automated data filtering framework. Extensive experiments across six image datasets demonstrate that Grad-Mimic significantly improves model performance, particularly enhancing CLIP training efficacy, outperforming existing data curation methods, and enabling highly accurate estimation of dataset quality.
๐ Abstract
Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.