SCALE-Sim TPU: Validating and Extending SCALE-Sim for TPUs

๐Ÿ“… 2026-03-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses critical limitations of existing cycle-accurate simulators for TPU-like systolic arraysโ€”namely, insufficient accuracy, lack of validation against real hardware, and poor integration with modern ML compiler stacks. To overcome these challenges, we extend SCALE-Sim v3 to build a high-fidelity simulation platform tailored for TPUs. Our approach features the first hardware-validated systolic GEMM model on actual TPU v4 silicon, introduces a lightweight tensor-shape-aware latency model for non-systolic operations, and integrates a StableHLO frontend to enable end-to-end simulation from mainstream frameworks such as JAX and PyTorch. Experimental results demonstrate a strong linear correlation between simulated cycles and real TPU v4 execution latency, with median relative error below 3% for non-systolic operations, substantially improving both simulation accuracy and practical utility.

Technology Category

Application Category

๐Ÿ“ Abstract
Cycle-accurate simulators are widely used to study systolic accelerators, yet their accuracy and usability are often limited by weak validation against real hardware and poor integration with modern ML compiler stacks. This paper presents SCALE-Sim TPU, a validated and extended version of SCALE-Sim v3 for TPU-style accelerators. Specifically, we make three contributions: (1) We validate SCALE-Sim's systolic GEMM model against measurements on Google TPU v4 and show that simulated cycle counts exhibit a strong linear correlation with hardware latency, enabling a simple cycle-to-latency mapping. (2) We introduce lightweight learned latency models for non-systolic elementwise operations, achieving median relative errors below 3 percent using only tensor size and shape, substantially improving end-to-end latency estimation. (3) We integrate a StableHLO-based frontend that allows workloads from modern ML frameworks such as JAX and PyTorch to be simulated directly via a unified compiler IR. Together, these contributions improve the fidelity, coverage, and practicality of cycle-accurate simulation for whole-model performance analysis on TPUs.
Problem

Research questions and friction points this paper is trying to address.

cycle-accurate simulation
systolic accelerators
TPU
validation
ML compiler integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

cycle-accurate simulation
TPU validation
learned latency models
StableHLO frontend
systolic array
๐Ÿ”Ž Similar Papers
No similar papers found.