LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

📅 2024-11-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale technical document deduplication is critical to mitigate redundancy, model memorization, and evaluation bias in LLM training—but existing document-level deduplication methods suffer from prohibitive memory and computational overhead. To address this, we propose BloomLSH: the first method to replace the MinHash-based LSH index with a Bloom filter, integrated with signature compression and distributed hashing for scalable, memory-efficient duplicate detection. BloomLSH achieves high precision while enabling extreme memory compression and linear scalability. On the peS2o dataset, it reduces disk footprint to just 0.6% and accelerates processing by 2.7× compared to MinHash LSH. At billion-document scale, it saves 54× memory and speeds up deduplication by 2.5×. BloomLSH establishes a new paradigm for efficient, lightweight, and production-deployable deduplication of ultra-large-scale training corpora.

Technology Category

Application Category

📝 Abstract
Deduplication is a major focus for assembling and curating training datasets for large language models (LLM) -- detecting and eliminating additional instances of the same content -- in large collections of technical documents. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Contemporary approaches to document-level deduplication are often extremely expensive in both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same deduplication performance as MinhashLSH with only a marginal increase in false positives (as low as 1e-5 in our experiments); demonstrates competitive runtime (270% faster than MinhashLSH on peS2o); and, crucially, uses just 0.6% of the disk space required by MinhashLSH to deduplicate peS2o. We demonstrate that this space advantage scales with increased dataset size -- at the extreme scale of several billion documents, LSHBloom promises a 250% speedup and a 54$ imes$ space advantage over traditional MinHashLSH scaling deduplication of text datasets to many billions of documents.
Problem

Research questions and friction points this paper is trying to address.

Memory-efficient deduplication for large language model datasets
Reducing runtime and space costs in extreme-scale document deduplication
Minimizing false positives while maintaining deduplication performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Bloom filters for memory efficiency
Improves speed by 270% over MinhashLSH
Reduces disk space usage to 0.6%
A
Arham Khan
Department of Computer Science, University of Chicago; Chicago, IL, United States
Robert Underwood
Robert Underwood
Assistant Computer Scientist, Argonne National Laboratory
Data for AI for ScienceLossy CompressionDistributed ComputingReliable Computer Infrastructure
C
Carlo Siebenschuh
Department of Computer Science, University of Chicago; Chicago, IL, United States
Y
Y. Babuji
Department of Computer Science, University of Chicago; Chicago, IL, United States
Aswathy Ajith
Aswathy Ajith
University of Chicago
NLPInformation Extraction
K
Kyle Hippe
Department of Computer Science, University of Chicago; Chicago, IL, United States
O
Ozan Gökdemir
Department of Computer Science, University of Chicago; Chicago, IL, United States
A
Alexander Brace
Department of Computer Science, University of Chicago; Chicago, IL, United States
Kyle Chard
Kyle Chard
University of Chicago and Argonne National Laboratory
computer sciencedistributed systemshigh performance computingscientific computing
Ian T. Foster
Ian T. Foster
University of Chicago and Argonne National Laboratory
Computer sciencecomputational sciencedistributed computingdata science