🤖 AI Summary
To address the low inference efficiency of Graph Convolutional Networks (GCNs) on large-scale sparse graphs—caused by irregular memory access patterns and poor data locality stemming from power-law degree distributions—this paper proposes GCoD, an algorithm-hardware co-design framework. Methodologically, GCoD introduces (1) a novel graph-local polarization partitioning algorithm that hierarchically decomposes the adjacency matrix into high- and low-density substructures, and (2) a density-aware dual-mode accelerator featuring separate execution paths for dense and sparse subgraphs, integrated with on-chip dataflow optimization and memory-access compression. Evaluated on real-world graph datasets, GCoD achieves up to 15,286× speedup over CPU, GPU, HyGCN, and AWB-GCN baselines, significantly reduces off-chip memory traffic, and maintains or even improves model accuracy.
📝 Abstract
Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art graph learning model. However, it can be notoriously challenging to inference GCNs over large graph datasets, limiting their application to large real-world graphs and hindering the exploration of deeper and more sophisticated GCN graphs. This is because real-world graphs can be extremely large and sparse. Furthermore, the node degree of GCNs tends to follow the power-law distribution and therefore have highly irregular adjacency matrices, resulting in prohibitive inefficiencies in both data processing and movement and thus substantially limiting the achievable GCN acceleration efficiency. To this end, this paper proposes a GCN algorithm and accelerator Co-Design framework dubbed GCoD which can largely alleviate the aforementioned GCN irregularity and boost GCNs’ inference efficiency. Specifically, on the algorithm level, GCoD integrates a split and conquer GCN training strategy that polarizes the graphs to be either denser or sparser in local neighborhoods without compromising the model accuracy, resulting in graph adjacency matrices that (mostly) have merely two levels of workload and enjoys largely enhanced regularity and thus ease of acceleration. On the hardware level, we further develop a dedicated two-pronged accelerator with a separated engine to process each of the aforementioned denser and sparser workloads, further boosting the overall utilization and acceleration efficiency. Extensive experiments and ablation studies validate that our GCoD consistently reduces the number of off-chip accesses, leading to speedups 15286×, 294×, 7.8×, and 2.5× as compared to CPUs, GPUs, and prior-art GCN accelerators including HyGCN and AWB-GCN, respectively, while maintaining or even improving the task accuracy. Additionally, we visualize GCoD trained graph adjacency matrices for a better understanding of its advantages.