FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration

๐Ÿ“… 2025-01-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

248K/year
๐Ÿค– AI Summary
To address the low efficiency of training data deduplication for large language models, this paper proposes the first MinHash Locality-Sensitive Hashing (LSH) framework deeply optimized for GPU clusters. Methodologically, it employs lightweight, partially reusable non-cryptographic hash functions (e.g., xxHash) to reduce computational redundancy, and integrates CUDA kernel optimization, multi-node GPU communication, and distributed hash tables to achieve end-to-end GPU acceleration. Its core contribution is the first MinHash LSH architecture specifically designed for large-scale GPU clusters, balancing high accuracy with exceptional throughput. Experiments demonstrate that our method processes one million documents 58.3ร— faster than SlimPajama (CPU-based) and 8.6ร— faster than NeMo Curator (GPU-accelerated); deduplicating a 1.2-trillion-token corpus requires only 5.1 hours on a 16-GPU clusterโ€”marking a substantial improvement in preprocessing efficiency for ultra-large-scale corpora.

Technology Category

Application Category

๐Ÿ“ Abstract
Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving training performance and efficiency of LLMs. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework sys that optimizes MinHash LSH for GPU clusters and leverages computationally efficient and partially reusable non-cryptographic hash functions. sys significantly outperforms the CPU-based deduplication tool included in SlimPajama by up to 58.3 times and the GPU-based deduplication tool included in NVIDIA NeMo Curator by up to 8.6 times when processing 1 million documents with a node of four GPUs. Deduplication of 1.2 trillion tokens is completed in just 5.1 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (https://github.com/mcrl/FED).
Problem

Research questions and friction points this paper is trying to address.

Big Data
Duplicate Removal
Model Training Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

FED
GPU acceleration
Hash function optimization
๐Ÿ”Ž Similar Papers
No similar papers found.