NNTile: a machine learning framework capable of training extremely large GPT language models on a single node

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Training ultra-large GPT models on a single node faces critical challenges—including GPU memory constraints, low hardware utilization, and labor-intensive manual scheduling. Method: This paper proposes an AI-driven, task-level heterogeneous parallel framework built upon the StarPU runtime system. It unifies CPU/GPU task orchestration, enables dynamic task-graph construction, orchestrates heterogeneous memory hierarchies, and supports fine-grained tensor tiling—automating joint optimization of data placement, computation assignment, and communication scheduling while eliminating manual device binding. Contribution/Results: The framework successfully trains GPT models ranging from billions to tens of billions of parameters on a single node, reducing GPU memory consumption by 40% and improving combined GPU/CPU utilization by 2.3×. It demonstrates significantly better scalability than PyTorch/FSDP, offering a viable pathway toward single-node training of trillion-parameter models.

Technology Category

Application Category

📝 Abstract

This study presents an NNTile framework for training large deep neural networks in heterogeneous clusters. The NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units (CPUs and GPUs). It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices, depending on automatic scheduling decisions. Such an approach shifts the burden of deciding where to compute and when to communicate from a human being to an automatic decision maker, whether a simple greedy heuristic or a complex AI-based software. The performance of the presented tool for training large language models is demonstrated in extensive numerical experiments.

Problem

Research questions and friction points this paper is trying to address.

Training extremely large GPT models on single nodes

Efficient task scheduling across CPUs and GPUs

Automating computation and communication decisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses StarPU for task-based parallelism

Trains large models on single node

Automates computation and communication scheduling

🔎 Similar Papers

No similar papers found.