🤖 AI Summary
This work investigates the efficient support of the two-dimensional five-point stencil—a canonical kernel in traditional high-performance computing—on the Tenstorrent Wormhole AI accelerator. The authors propose two novel approaches: an element-wise submatrix-based Axpy formulation and a matrix-multiplication-based MatMul reformulation, marking the first implementation and systematic evaluation of scientific stencil kernels on Wormhole. Through fine-grained performance profiling and theoretical modeling, they identify PCIe data transfer and initialization overheads as primary bottlenecks, highlighting key directions for hardware-software co-optimization. Experimental results show that isolated kernel performance is comparable to that of a CPU, with the Axpy variant achieving superior energy efficiency for large inputs. Although end-to-end execution remains approximately three times slower than conventional platforms, this study provides critical empirical insights and design principles for extending AI accelerators into the HPC domain.
📝 Abstract
As investment in AI-focused accelerators grows and their deployment in supercomputing facilities expands, understanding whether these architectures can efficiently support traditional scientific kernels is critical for the future of High-Performance Computing. We investigate the mapping of 2D 5-point stencil computations onto the Tenstorrent Wormhole, a RISC-V AI dataflow accelerator. We develop two heterogeneous implementations: Axpy, which decomposes the stencil into element-wise submatrix operations, and MatMul, which reformulates it as a matrix multiplication. While the CPU baseline remains 3x faster end-to-end, profiling reveals that the isolated Wormhole kernel is competitive with CPU execution, with the gap driven by PCIe transfers, device initialization, and host-side preprocessing. Despite slower runtime, Axpy achieves lower energy consumption than the CPU baseline for large inputs. Through detailed profiling and theoretical analysis, we identify key architectural and software limitations of the current platform and outline concrete hardware and software directions that could make AI accelerators competitive for HPC workloads.