GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data

📅 2024-10-03

📈 Citations: 1

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address the high cost and low efficiency of curating high-quality web data for large language model training, this paper proposes SIEVE: a novel active learning–knowledge distillation framework guided by GPT-4o to train lightweight text classifiers. SIEVE achieves low-cost (<1% API call overhead) and high-accuracy data filtering while supporting few-shot annotation, multi-prompt strategies, and domain adaptation—enhancing generalizability and scalability. Evaluated on five specialized filtering tasks, it matches GPT-4o’s performance; in the DataComp-LM benchmark, it surpasses prior state-of-the-art methods and yields substantive improvements in downstream model quality. Its core contribution lies in the first principled balance between large-model capability and lightweight-model efficiency, establishing an efficient, general-purpose, and customizable paradigm for high-fidelity pretraining data curation.

Technology Category

Application Category

📝 Abstract

Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or specialized domains from web-scale corpora -- a valuable capability given the current scarcity of high-quality domain-specific datasets. Extensive experiments using automatic and human evaluation metrics show that SIEVE and GPT-4o achieve similar performance on five highly specific filtering prompts. In addition, when performing quality filtering on web crawl datasets, we demonstrate SIEVE can further improve over state-of-the-art quality filtering methods in the DataComp-LM challenge for selecting LLM pretraining data.

Problem

Research questions and friction points this paper is trying to address.

Large-scale Language Models

Data Selection

Internet Data Quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

SIEVE

active learning

high-quality training data

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models