🤖 AI Summary
This work presents the first systematic evaluation and optimization of the Cerebras CS-3 platform for high-sparsity linear algebra computations, with a focus on sparse-dense matrix multiplication (SpMM) and sampled dense-dense matrix multiplication (SDDMM). Tailored sparse kernels are designed to align with the CS-3’s dataflow architecture, optimizing memory access patterns and I/O efficiency to enhance scalability for large-scale sparse matrices. Experimental results demonstrate that at 90% sparsity, the CS-3 achieves speedups of 100× and 20× over CPU baselines for SpMM and SDDMM, respectively. However, performance degrades significantly beyond 99% sparsity, falling below CPU performance. This study fills a critical gap in understanding the CS-3’s capabilities for sparse computation, revealing its substantial potential in moderately sparse regimes and identifying key bottlenecks under extreme sparsity.
📝 Abstract
In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as scientific applications like molecular dynamics simulations. While dense compute workloads have been thoroughly explored for the CS-3, its potential for sparse workloads has not been fully examined. Applications requiring sparse linear algebra kernels, such as GNNs, linear solvers, and recommendation systems, could achieve good performance on a dataflow accelerator like the CS-3.
In this work, we explore two key sparse linear algebra kernels, sparse-dense matrix multiplication (SpMM) and sampled dense-dense matrix multiplication (SDDMM), on the Cerebras CS-3. We propose low-level CS-3 kernel designs for these operations and optimize our designs to improve I/O performance, memory footprint, and scalability to large matrices. Our evaluation examines memory footprint and SpMM/SDDMM speedup relative to CPU. The evaluation suggests that the CS-3 can outperform CPU by 100$\times$ for SpMM with 90\% sparse matrices with performance improving as sparse matrix dimensionality increases. SDDMM on CS-3 can outperform CPU 20$\times$ for 90\% sparse matrices. We additionally find that as sparsity increases to beyond 99\%, the CS-3 suffers from performance degradation that makes it slower than CPU for SpMM.