Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

276K/year

🤖 AI Summary

This work addresses the challenges of sparse general matrix-matrix multiplication (SpGEMM) on heterogeneous supercomputing systems, where unstructured sparsity and hierarchical GPU interconnects lead to poor memory efficiency, high communication overhead, and limited scalability. The authors propose Trident, a novel algorithm that introduces a ternary partitioning scheme combined with a hierarchy-aware 2D distributed strategy, communication-avoiding techniques, and asynchronous communication mechanisms. Trident explicitly exploits the disparity between high intra-node bandwidth and low inter-node bandwidth to optimize data distribution and enhance computational locality. Experiments on the Perlmutter supercomputer demonstrate that Trident achieves up to 2.38× speedup (1.54× geometric mean) over conventional 2D SpGEMM implementations, reduces inter-node communication volume by up to 2×, and delivers up to 2× acceleration in Markov clustering applications.

Technology Category

Application Category

📝 Abstract

The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies.

Problem

Research questions and friction points this paper is trying to address.

SpGEMM

communication-avoiding

hierarchical interconnects

sparse matrix multiplication

distributed memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Communication-Avoiding

Trident Partitioning

Hierarchical GPU Interconnects