🤖 AI Summary
Large-scale technical document deduplication is critical to mitigate redundancy, model memorization, and evaluation bias in LLM training—but existing document-level deduplication methods suffer from prohibitive memory and computational overhead. To address this, we propose BloomLSH: the first method to replace the MinHash-based LSH index with a Bloom filter, integrated with signature compression and distributed hashing for scalable, memory-efficient duplicate detection. BloomLSH achieves high precision while enabling extreme memory compression and linear scalability. On the peS2o dataset, it reduces disk footprint to just 0.6% and accelerates processing by 2.7× compared to MinHash LSH. At billion-document scale, it saves 54× memory and speeds up deduplication by 2.5×. BloomLSH establishes a new paradigm for efficient, lightweight, and production-deployable deduplication of ultra-large-scale training corpora.
📝 Abstract
Deduplication is a major focus for assembling and curating training datasets for large language models (LLM) -- detecting and eliminating additional instances of the same content -- in large collections of technical documents. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Contemporary approaches to document-level deduplication are often extremely expensive in both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same deduplication performance as MinhashLSH with only a marginal increase in false positives (as low as 1e-5 in our experiments); demonstrates competitive runtime (270% faster than MinhashLSH on peS2o); and, crucially, uses just 0.6% of the disk space required by MinhashLSH to deduplicate peS2o. We demonstrate that this space advantage scales with increased dataset size -- at the extreme scale of several billion documents, LSHBloom promises a 250% speedup and a 54$ imes$ space advantage over traditional MinHashLSH scaling deduplication of text datasets to many billions of documents.