CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks

πŸ“… 2024-04-02
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing distributed GNN training frameworks (e.g., DistDGL, PyG) incur prohibitive memory overhead on billion-scale graphs, rendering them infeasible for deployment on commodity multi-GPU workstations. To address this, we propose a streaming edge-input graph partitioning paradigm and introduce SPRINGβ€”a novel dynamic partitioning algorithm that loads and partitions edge data on-demand during training, drastically reducing cross-device data replication. By jointly optimizing streaming graph processing, memory-aware scheduling, and distributed training, our approach achieves significant efficiency gains. Evaluated on 16 public datasets, it successfully trains GNNs on the largest publicly available graph to date (Graph500-scale), reduces memory consumption to levels where prior methods fail, and cuts the average replication factor by 50%. The result is a low-cost, highly scalable solution for distributed GNN training.

Technology Category

Application Category

πŸ“ Abstract
Graph neural networks have been shown successful in recent years. While different GNN architectures and training systems have been developed, GNN training on large-scale real-world graphs still remains challenging. Existing distributed systems load the entire graph in memory for graph partitioning, requiring a huge memory space to process large graphs and thus hindering GNN training on such large graphs using commodity workstations. In this paper, we propose CATGNN, a cost-efficient and scalable distributed GNN training system which focuses on scaling GNN training to billion-scale or larger graphs under limited computational resources. Among other features, it takes a stream of edges as input, instead of loading the entire graph in memory, for partitioning. We also propose a novel streaming partitioning algorithm named SPRING for distributed GNN training. We verify the correctness and effectiveness of CATGNN with SPRING on 16 open datasets. In particular, we demonstrate that CATGNN can handle the largest publicly available dataset with limited memory, which would have been infeasible without increasing the memory space. SPRING also outperforms state-of-the-art partitioning algorithms significantly, with a 50% reduction in replication factor on average.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory requirements for distributed GNN training
Enables training on large graphs with limited GPU memory
Improves graph partitioning quality for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming-based edge input reduces memory for partitioning
Enables training when GPU memory is smaller than graph data
SPRING algorithm improves streaming partitioning quality for GNNs
πŸ”Ž Similar Papers
No similar papers found.