🤖 AI Summary
To address the inefficiencies of irregular memory access and high communication overhead in distributed full-batch GCN training on CPU-based HPC systems, this work proposes: (1) a generic aggregation operator tailored to graph-structural irregularity; (2) a novel “pre-post aggregation” paradigm that jointly optimizes pre-aggregation and post-aggregation scheduling; and (3) a synergistic communication compression mechanism integrating gradient/feature quantization with label propagation. The resulting HPC-grade CPU-distributed training framework achieves up to 6× speedup over state-of-the-art methods across multiple large-scale graph datasets, scales effectively to thousand-core CPU clusters, preserves convergence behavior and model accuracy, and significantly reduces power consumption and hardware cost compared to GPU-based alternatives.
📝 Abstract
Graph Convolutional Networks (GCNs) are widely used in various domains. However, training distributed full-batch GCNs on large-scale graphs poses challenges due to inefficient memory access patterns and high communication overhead. This paper presents general and efficient aggregation operators designed for irregular memory access patterns. Additionally, we propose a pre-post-aggregation approach and a quantization with label propagation method to reduce communication costs. Combining these techniques, we develop an efficient and scalable distributed GCN training framework, emph{SuperGCN}, for CPU-powered supercomputers. Experimental results on multiple large graph datasets show that our method achieves a speedup of up to 6$ imes$ compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs, without sacrificing model convergence and accuracy. Our framework achieves performance on CPU-powered supercomputers comparable to that of GPU-powered supercomputers, with a fraction of the cost and power budget.