Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

To address the unstructured nature of web-based pretraining data—which leads to content uncontrollability and labor-intensive curation—this paper proposes WebOrganizer, a novel framework introducing the first “topic + format” dual-dimensional domain taxonomy for structured organization and automatic annotation of web pages. Methodologically, it (i) builds an LLM-distillation-driven lightweight classifier; (ii) designs a domain-aware data mixing strategy; (iii) develops taxonomy-guided automatic annotation; and (iv) quantifies, for the first time, how quality filtering shifts implicit domain distributions, demonstrating the synergistic benefits of joint domain modeling and quality filtering. Experiments show that WebOrganizer significantly improves downstream task performance and enhances existing quality filtering efficacy, establishing a new, interpretable, and controllable paradigm for large language model pretraining data governance.

Technology Category

Application Category

📝 Abstract

Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.

Problem

Research questions and friction points this paper is trying to address.

Organize web data into domains

Improve pre-training data curation

Enhance language model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Organizing web data into domains

Using taxonomies for data annotation

Mixing domains to enhance model performance

🔎 Similar Papers

AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge