🤖 AI Summary
Large language models (LLMs) rely on massive web corpora—such as Common Crawl—for training, yet indiscriminate crawling introduces significant data quality, safety, and ethical risks. Existing research on harmful content is largely constrained by computational limits, relying on small-scale samples and lacking scalable, dataset-level analysis capabilities. To address this, we propose the first efficient indexing and retrieval framework for terabyte-scale, multilingual training corpora. Built atop Elasticsearch, our distributed indexing pipeline integrates robust web parsing and multilingual text processing, enabling millisecond-latency full-text search and fine-grained content filtering. Evaluated on the 1.5 TB FineWeb-2 corpus, it achieves sub-2-second response times for 90% of queries. This work enables, for the first time, real-time, scalable, and precisely localizable auditing of harmful content across entire training datasets—substantially advancing the efficiency, accuracy, and transparency of AI data governance.
📝 Abstract
Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.