Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers

📅 2024-11-17

🏛️ International Conference for High Performance Computing, Networking, Storage and Analysis

📈 Citations: 0

✨ Influential: 0

career value

264K/year

🤖 AI Summary

This work addresses dual challenges—system scalability and data privacy leakage—in open-source, scalable large language model (LLM) training on GPU-based supercomputers, targeting models up to 405B parameters. We propose the first four-dimensional hybrid parallelism (data, tensor, pipeline, and expert parallelism), integrated with bf16 optimization, non-blocking communication-computation overlap, and performance-modeling-guided configuration search. We formally define and mitigate “catastrophic memorization”—a privacy risk wherein LLMs overfit and retain sensitive training data—and introduce privacy-aware training strategies to reduce such memorization. We release AxoNN, a high-performance open-source framework enabling cross-architecture, near-linear scaling across heterogeneous exascale systems (Perlmutter, Frontier, Alps). Experiments achieve 1.423 ExaFLOP/s on Alps for GPT-class model training—the highest reported—and successfully fine-tune a 405B-parameter model on Frontier, significantly improving both scalability efficiency and privacy assurance.

Technology Category

Application Category

📝 Abstract

Training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNn. We describe several performance optimizations in AxoNN to improve matrix multiply kernel performance, overlap non-blocking collectives with computation, and performance modeling to choose performance optimal configurations. These have resulted in unprecedented scaling and peak flop/s (bf16) for training of GPT-style transformer models on Perlmutter (620.1 Petaflop/s), Frontier (1.381 Exaflop/s) and Alps (1.423 Exaflop/s). While the abilities of LLMs improve with the number of trainable parameters, so do privacy and copyright risks caused by memorization of training data, which can cause disclosure of sensitive or private information at inference time. We highlight this side effect of scale through experiments that explore “catastrophic memorization,” where models are sufficiently large to memorize training data in a single pass, and present an approach to prevent it. As part of this study, we demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.

Problem

Research questions and friction points this paper is trying to address.

Scalable LLM training on GPU supercomputers

Optimizing performance with AxoNN framework

Addressing privacy risks in large-scale LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source scalable framework AxoNN

Four-dimensional hybrid parallel algorithm

Performance optimizations for LLM training

🔎 Similar Papers

No similar papers found.