π€ AI Summary
Conventional TPU architectures exhibit low energy efficiency and high power consumption during inference of generative models (e.g., LLMs and diffusion Transformers). Method: This work proposes a novel TPU architecture integrating digital Compute-in-Memory (CIM) units directly into the matrix multiplication unit (MXU), replacing the traditional systolic array to fundamentally alleviate the von Neumann bottleneck. We further co-design a customized microarchitectural model and a hardware-aware mapping methodology tailored for generative models, enabling end-to-end optimization in simulation. Contribution/Results: Compared to the TPUv4i baseline, our architecture achieves 44.2% and 33.8% higher inference throughput for LLMs and diffusion Transformers, respectively, while reducing MXU energy consumption by 27.3Γβyielding substantial improvements in both energy efficiency and computational throughput.
π Abstract
With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.