Stencil Computations on Cerebras Wafer-Scale Engine

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the memory wall bottleneck in traditional high-performance computing architectures, which impedes efficient execution of stencil computations prevalent in scientific simulations. For the first time, the authors successfully deploy stencil kernels on the AI-optimized wafer-scale engine Cerebras WSE-3 by introducing CStencil, a specialized framework that leverages the system’s distributed on-chip SRAM and high-bandwidth interconnects within a dataflow programming model to implement two-dimensional stencil computations. Evaluated against single-precision ConvStencil on an NVIDIA A100 GPU, CStencil achieves up to a 342× speedup. Roofline analysis confirms that CStencil fully saturates both computational and memory resources, substantially improving hardware utilization and demonstrating that the WSE-3 is not limited to low-precision AI workloads but can also excel at high-precision scientific computing tasks.

📝 Abstract

Stencil computations are a fundamental kernel in scientific computing, critical for simulations in domains such as fluid dynamics and climate modeling. However, these computations are often memory-bound on traditional High-Performance Computing architectures like GPUs, struggling against the "Memory Wall". Simultaneously, the rise of AI-oriented hardware, such as the Cerebras Wafer-Scale Engine, offers massive core parallelism and high-bandwidth on-chip memory, though typically optimized for lower-precision workloads. This work investigates the viability of bridging this divergence by mapping stencil algorithms onto the Cerebras WSE-3. The study introduces CStencil, a novel framework designed to implement two-dimensional stencil computations on the WSE-3. To ensure a rigorous and fair performance evaluation, the research also adapts ConvStencil, a state-of-the-art GPU stencil solver, porting it from its original double-precision design to single-precision for execution on an NVIDIA A100 GPU. Experimental results show that the WSE-3's distributed SRAM and mesh interconnect effectively eliminate the off-chip memory bottlenecks common in GPU implementations. CStencil achieves speedups of up to 342x over the adapted ConvStencil version. A roofline model analysis further confirms that CStencil saturates the available compute and memory resources, demonstrating that the WSE dataflow architecture can be successfully repurposed for traditional scientific algorithms. These findings highlight the potential of the WSE-3 to deliver hardware utilization levels unattainable on conventional systems, offering a promising path toward overcoming the memory limitations of current HPC architectures.

Problem

Research questions and friction points this paper is trying to address.

Stencil computations

Memory Wall

High-Performance Computing

scientific computing

memory-bound

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stencil computations

Wafer-Scale Engine

Memory Wall