๐ค AI Summary
This work addresses critical limitations of existing cycle-accurate simulators for TPU-like systolic arraysโnamely, insufficient accuracy, lack of validation against real hardware, and poor integration with modern ML compiler stacks. To overcome these challenges, we extend SCALE-Sim v3 to build a high-fidelity simulation platform tailored for TPUs. Our approach features the first hardware-validated systolic GEMM model on actual TPU v4 silicon, introduces a lightweight tensor-shape-aware latency model for non-systolic operations, and integrates a StableHLO frontend to enable end-to-end simulation from mainstream frameworks such as JAX and PyTorch. Experimental results demonstrate a strong linear correlation between simulated cycles and real TPU v4 execution latency, with median relative error below 3% for non-systolic operations, substantially improving both simulation accuracy and practical utility.
๐ Abstract
Cycle-accurate simulators are widely used to study systolic accelerators, yet their accuracy and usability are often limited by weak validation against real hardware and poor integration with modern ML compiler stacks. This paper presents SCALE-Sim TPU, a validated and extended version of SCALE-Sim v3 for TPU-style accelerators. Specifically, we make three contributions: (1) We validate SCALE-Sim's systolic GEMM model against measurements on Google TPU v4 and show that simulated cycle counts exhibit a strong linear correlation with hardware latency, enabling a simple cycle-to-latency mapping. (2) We introduce lightweight learned latency models for non-systolic elementwise operations, achieving median relative errors below 3 percent using only tensor size and shape, substantially improving end-to-end latency estimation. (3) We integrate a StableHLO-based frontend that allows workloads from modern ML frameworks such as JAX and PyTorch to be simulated directly via a unified compiler IR. Together, these contributions improve the fidelity, coverage, and practicality of cycle-accurate simulation for whole-model performance analysis on TPUs.