🤖 AI Summary
Data selection for large language model (LLM) pretraining relies heavily on embedding-based similarity metrics, yet existing methods lack systematic evaluation grounded in pretraining objectives. Method: We propose the first pretraining-aware evaluation framework for data embedding models, establishing a quantitative linkage between embedding-space similarity and downstream pretraining loss reduction. Our empirical study employs a 1.7B-parameter decoder-only model trained on The Pile. Contribution/Results: We find that simple average word embeddings achieve performance comparable to state-of-the-art complex embedding models—revealing a structural misalignment between current embedding designs and pretraining objectives. The framework not only delineates the efficacy boundaries of diverse embedding methods but also provides, for the first time, a reproducible evaluation paradigm and principled design guidance for developing “pretraining-aware” customized embedding models.
📝 Abstract
Similarity between training examples is used to curate pretraining datasets for language models by many methods -- for diversification and to select examples similar to high-quality data. However, similarity is typically measured with off-the-shelf embedding models that are generic or trained for tasks such as retrieval. This paper introduces a framework to analyze the suitability of embedding models specifically for data curation in the language model pretraining setting. We quantify the correlation between similarity in the embedding space to similarity in pretraining loss between different training examples, and how diversifying in the embedding space affects pretraining quality. We analyze a variety of embedding models in our framework, with experiments using the Pile dataset for pretraining a 1.7B parameter decoder-only language model. We find that the embedding models we consider are all useful for pretraining data curation. Moreover, a simple approach of averaging per-token embeddings proves to be surprisingly competitive with more sophisticated embedding models -- likely because the latter are not designed specifically for pretraining data curation. Indeed, we believe our analysis and evaluation framework can serve as a foundation for the design of embedding models that specifically reason about similarity in pretraining datasets.