LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

📅 2024-11-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Large-scale technical document deduplication is critical to mitigate redundancy, model memorization, and evaluation bias in LLM training—but existing document-level deduplication methods suffer from prohibitive memory and computational overhead. To address this, we propose BloomLSH: the first method to replace the MinHash-based LSH index with a Bloom filter, integrated with signature compression and distributed hashing for scalable, memory-efficient duplicate detection. BloomLSH achieves high precision while enabling extreme memory compression and linear scalability. On the peS2o dataset, it reduces disk footprint to just 0.6% and accelerates processing by 2.7× compared to MinHash LSH. At billion-document scale, it saves 54× memory and speeds up deduplication by 2.5×. BloomLSH establishes a new paradigm for efficient, lightweight, and production-deployable deduplication of ultra-large-scale training corpora.

Technology Category

Application Category

📝 Abstract

Deduplication is a major focus for assembling and curating training datasets for large language models (LLM) -- detecting and eliminating additional instances of the same content -- in large collections of technical documents. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Contemporary approaches to document-level deduplication are often extremely expensive in both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same deduplication performance as MinhashLSH with only a marginal increase in false positives (as low as 1e-5 in our experiments); demonstrates competitive runtime (270% faster than MinhashLSH on peS2o); and, crucially, uses just 0.6% of the disk space required by MinhashLSH to deduplicate peS2o. We demonstrate that this space advantage scales with increased dataset size -- at the extreme scale of several billion documents, LSHBloom promises a 250% speedup and a 54$ imes$ space advantage over traditional MinHashLSH scaling deduplication of text datasets to many billions of documents.

Problem

Research questions and friction points this paper is trying to address.

Memory-efficient deduplication for large language model datasets

Reducing runtime and space costs in extreme-scale document deduplication

Minimizing false positives while maintaining deduplication performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Bloom filters for memory efficiency

Improves speed by 270% over MinhashLSH

Reduces disk space usage to 0.6%

🔎 Similar Papers

Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs