Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work presents the first systematic evaluation and optimization of the Cerebras CS-3 platform for high-sparsity linear algebra computations, with a focus on sparse-dense matrix multiplication (SpMM) and sampled dense-dense matrix multiplication (SDDMM). Tailored sparse kernels are designed to align with the CS-3’s dataflow architecture, optimizing memory access patterns and I/O efficiency to enhance scalability for large-scale sparse matrices. Experimental results demonstrate that at 90% sparsity, the CS-3 achieves speedups of 100× and 20× over CPU baselines for SpMM and SDDMM, respectively. However, performance degrades significantly beyond 99% sparsity, falling below CPU performance. This study fills a critical gap in understanding the CS-3’s capabilities for sparse computation, revealing its substantial potential in moderately sparse regimes and identifying key bottlenecks under extreme sparsity.

📝 Abstract

In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as scientific applications like molecular dynamics simulations. While dense compute workloads have been thoroughly explored for the CS-3, its potential for sparse workloads has not been fully examined. Applications requiring sparse linear algebra kernels, such as GNNs, linear solvers, and recommendation systems, could achieve good performance on a dataflow accelerator like the CS-3. In this work, we explore two key sparse linear algebra kernels, sparse-dense matrix multiplication (SpMM) and sampled dense-dense matrix multiplication (SDDMM), on the Cerebras CS-3. We propose low-level CS-3 kernel designs for these operations and optimize our designs to improve I/O performance, memory footprint, and scalability to large matrices. Our evaluation examines memory footprint and SpMM/SDDMM speedup relative to CPU. The evaluation suggests that the CS-3 can outperform CPU by 100$\times$ for SpMM with 90\% sparse matrices with performance improving as sparse matrix dimensionality increases. SDDMM on CS-3 can outperform CPU 20$\times$ for 90\% sparse matrices. We additionally find that as sparsity increases to beyond 99\%, the CS-3 suffers from performance degradation that makes it slower than CPU for SpMM.

Problem

Research questions and friction points this paper is trying to address.

sparse matrix multiplication

SpMM

SDDMM

AI accelerator

Cerebras CS-3

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Matrix Multiplication

Cerebras CS-3

SpMM