๐ค AI Summary
To address the low efficiency of training data deduplication for large language models, this paper proposes the first MinHash Locality-Sensitive Hashing (LSH) framework deeply optimized for GPU clusters. Methodologically, it employs lightweight, partially reusable non-cryptographic hash functions (e.g., xxHash) to reduce computational redundancy, and integrates CUDA kernel optimization, multi-node GPU communication, and distributed hash tables to achieve end-to-end GPU acceleration. Its core contribution is the first MinHash LSH architecture specifically designed for large-scale GPU clusters, balancing high accuracy with exceptional throughput. Experiments demonstrate that our method processes one million documents 58.3ร faster than SlimPajama (CPU-based) and 8.6ร faster than NeMo Curator (GPU-accelerated); deduplicating a 1.2-trillion-token corpus requires only 5.1 hours on a 16-GPU clusterโmarking a substantial improvement in preprocessing efficiency for ultra-large-scale corpora.
๐ Abstract
Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving training performance and efficiency of LLMs. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework sys that optimizes MinHash LSH for GPU clusters and leverages computationally efficient and partially reusable non-cryptographic hash functions. sys significantly outperforms the CPU-based deduplication tool included in SlimPajama by up to 58.3 times and the GPU-based deduplication tool included in NVIDIA NeMo Curator by up to 8.6 times when processing 1 million documents with a node of four GPUs. Deduplication of 1.2 trillion tokens is completed in just 5.1 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (https://github.com/mcrl/FED).