Craw4LLM: Efficient Web Crawling for LLM Pretraining

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address severe inefficiencies in LLM pretraining web crawling—including high discard rates of low-quality pages, excessive server load, and misalignment between conventional graph-connectivity-based crawling strategies and actual model training needs—this paper proposes a pretraining-utility-driven targeted crawling framework. Our method prioritizes URLs based on their empirically measured impact on LLM pretraining performance—not merely hyperlink topology—integrating three novel mechanisms: (i) commercial search engine index modeling to estimate page influence, (ii) Web-graph-guided preference-aware scheduling, and (iii) quality-aware URL filtering. Experiments demonstrate that crawling only 21% of the URLs suffices to reproduce downstream task performance achieved with full-scale crawling, substantially reducing bandwidth consumption and server-side resource pressure. This work establishes the first utility-oriented crawling paradigm explicitly optimized for LLM pretraining efficacy.

Technology Category

Application Category

📝 Abstract

Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Crawl4LLM.

Problem

Research questions and friction points this paper is trying to address.

Improves web crawling efficiency for LLM pretraining

Reduces data waste by prioritizing high-quality webpages

Minimizes website burden while maintaining LLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritizes webpages by LLM influence

Reduces crawling waste significantly

Improves LLM pretraining data quality

🔎 Similar Papers

AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge