CaPGNN: Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high communication overhead and GPU load imbalance in full-batch Graph Neural Network (GNN) training on single-server multi-GPU systems, this paper proposes a joint caching and resource-aware graph partitioning framework. Our method innovatively integrates an adaptive feature caching mechanism with a dynamic graph partitioning strategy tailored to GPU heterogeneity, jointly optimizing CPU-GPU memory hierarchy utilization and computational resource allocation. It achieves co-optimization across three dimensions: subgraph size, feature reuse, and communication granularity. Extensive experiments on multiple large-scale graph datasets demonstrate that our approach reduces total communication volume by up to 96% and accelerates end-to-end training by up to 12.7× over state-of-the-art methods. This significantly improves scalability and hardware utilization of full-batch GNN training.

Technology Category

Application Category

📝 Abstract
Graph Neural Networks (GNNs) have shown remarkable capabilities in processing graph-structured data prevalent in various real-world applications. However, the scalability of full-batch GNN training becomes severely limited by high communication overhead and load imbalance in distributed environments. In this paper, we present CaPGNN, a novel framework for efficient parallel full-batch GNN training on single-server with multi-GPU, designed specifically to reduce redundant inter-GPU communication and balance computational workloads. We propose a joint adaptive caching algorithm that leverages both CPU and GPU memory to significantly reduce the repetitive transmission of vertex features across partitions. Additionally, we introduce a resource-aware graph partitioning algorithm that adjusts subgraph sizes dynamically according to the heterogeneous computational and communication capacities of GPUs. Extensive experiments on large-scale benchmark datasets demonstrate that CaPGNN effectively reduces communication costs by up to 96% and accelerates GNN training by up to 12.7 times compared to state-of-the-art approaches. Our results highlight the potential of adaptive caching and resource-aware partitioning to facilitate scalable, efficient, and practical deployment of full-batch GNN training in distributed computing environments.
Problem

Research questions and friction points this paper is trying to address.

Reducing communication overhead in distributed GNN training
Balancing computational workloads across multiple GPUs
Optimizing resource usage with adaptive caching and partitioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint adaptive caching algorithm using CPU and GPU memory
Resource-aware graph partitioning adjusting subgraph sizes dynamically
Reduces inter-GPU communication and balances computational workloads
🔎 Similar Papers
No similar papers found.