Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs

πŸ“… 2025-03-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Conventional TPU architectures exhibit low energy efficiency and high power consumption during inference of generative models (e.g., LLMs and diffusion Transformers). Method: This work proposes a novel TPU architecture integrating digital Compute-in-Memory (CIM) units directly into the matrix multiplication unit (MXU), replacing the traditional systolic array to fundamentally alleviate the von Neumann bottleneck. We further co-design a customized microarchitectural model and a hardware-aware mapping methodology tailored for generative models, enabling end-to-end optimization in simulation. Contribution/Results: Compared to the TPUv4i baseline, our architecture achieves 44.2% and 33.8% higher inference throughput for LLMs and diffusion Transformers, respectively, while reducing MXU energy consumption by 27.3Γ—β€”yielding substantial improvements in both energy efficiency and computational throughput.

Technology Category

Application Category

πŸ“ Abstract
With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.
Problem

Research questions and friction points this paper is trying to address.

Efficient deployment of generative models on TPUs
Reducing power consumption in TPU architectures
Improving performance and energy efficiency using Compute-in-Memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates digital Compute-in-Memory in TPUs
Replaces conventional digital systolic arrays
Achieves significant energy and performance improvements
πŸ”Ž Similar Papers
Z
Zhantong Zhu
School of Integrated Circuits, Peking University, Beijing, China; School of EECS, Peking University, Beijing, China
H
Hongou Li
School of Integrated Circuits, Peking University, Beijing, China; School of EECS, Peking University, Beijing, China
W
Wenjie Ren
School of Integrated Circuits, Peking University, Beijing, China
Meng Wu
Meng Wu
Department of Electrical Engineering, Stanford University
Medical ImagingMachine LearningComputer Vision
L
Le Ye
School of Integrated Circuits, Peking University, Beijing, China
R
Ru Huang
School of Integrated Circuits, Peking University, Beijing, China
Tianyu Jia
Tianyu Jia
Assistant Professor, Peking University
VLSI DesignComputer Architecture