FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address the low efficiency of training data deduplication for large language models, this paper proposes the first MinHash Locality-Sensitive Hashing (LSH) framework deeply optimized for GPU clusters. Methodologically, it employs lightweight, partially reusable non-cryptographic hash functions (e.g., xxHash) to reduce computational redundancy, and integrates CUDA kernel optimization, multi-node GPU communication, and distributed hash tables to achieve end-to-end GPU acceleration. Its core contribution is the first MinHash LSH architecture specifically designed for large-scale GPU clusters, balancing high accuracy with exceptional throughput. Experiments demonstrate that our method processes one million documents 58.3× faster than SlimPajama (CPU-based) and 8.6× faster than NeMo Curator (GPU-accelerated); deduplicating a 1.2-trillion-token corpus requires only 5.1 hours on a 16-GPU cluster—marking a substantial improvement in preprocessing efficiency for ultra-large-scale corpora.

Technology Category

Application Category

📝 Abstract

Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving training performance and efficiency of LLMs. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework sys that optimizes MinHash LSH for GPU clusters and leverages computationally efficient and partially reusable non-cryptographic hash functions. sys significantly outperforms the CPU-based deduplication tool included in SlimPajama by up to 58.3 times and the GPU-based deduplication tool included in NVIDIA NeMo Curator by up to 8.6 times when processing 1 million documents with a node of four GPUs. Deduplication of 1.2 trillion tokens is completed in just 5.1 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (https://github.com/mcrl/FED).

Problem

Research questions and friction points this paper is trying to address.

Big Data

Duplicate Removal

Model Training Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

FED

GPU acceleration

Hash function optimization

🔎 Similar Papers

No similar papers found.

ByteDance

United States / China / Singapore

Performance Engineer, GPU

Anthropic

$280,000—$850,000 USD

San Francisco, CA | New York City, NY | Seattle, WA / San Francisco, CA, San Francisco, California, United States

Authors to Follow