Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the unstructured nature of web-based pretraining data—which leads to content uncontrollability and labor-intensive curation—this paper proposes WebOrganizer, a novel framework introducing the first “topic + format” dual-dimensional domain taxonomy for structured organization and automatic annotation of web pages. Methodologically, it (i) builds an LLM-distillation-driven lightweight classifier; (ii) designs a domain-aware data mixing strategy; (iii) develops taxonomy-guided automatic annotation; and (iv) quantifies, for the first time, how quality filtering shifts implicit domain distributions, demonstrating the synergistic benefits of joint domain modeling and quality filtering. Experiments show that WebOrganizer significantly improves downstream task performance and enhances existing quality filtering efficacy, establishing a new, interpretable, and controllable paradigm for large language model pretraining data governance.

Technology Category

Application Category

📝 Abstract
Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
Problem

Research questions and friction points this paper is trying to address.

Organize web data into domains
Improve pre-training data curation
Enhance language model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Organizing web data into domains
Using taxonomies for data annotation
Mixing domains to enhance model performance
🔎 Similar Papers
No similar papers found.