Craw4LLM: Efficient Web Crawling for LLM Pretraining

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe inefficiencies in LLM pretraining web crawling—including high discard rates of low-quality pages, excessive server load, and misalignment between conventional graph-connectivity-based crawling strategies and actual model training needs—this paper proposes a pretraining-utility-driven targeted crawling framework. Our method prioritizes URLs based on their empirically measured impact on LLM pretraining performance—not merely hyperlink topology—integrating three novel mechanisms: (i) commercial search engine index modeling to estimate page influence, (ii) Web-graph-guided preference-aware scheduling, and (iii) quality-aware URL filtering. Experiments demonstrate that crawling only 21% of the URLs suffices to reproduce downstream task performance achieved with full-scale crawling, substantially reducing bandwidth consumption and server-side resource pressure. This work establishes the first utility-oriented crawling paradigm explicitly optimized for LLM pretraining efficacy.

Technology Category

Application Category

📝 Abstract
Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Crawl4LLM.
Problem

Research questions and friction points this paper is trying to address.

Improves web crawling efficiency for LLM pretraining
Reduces data waste by prioritizing high-quality webpages
Minimizes website burden while maintaining LLM performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritizes webpages by LLM influence
Reduces crawling waste significantly
Improves LLM pretraining data quality
🔎 Similar Papers
2024-06-27Journal of Mathematical & Computer ApplicationsCitations: 2