Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Conventional TPU architectures exhibit low energy efficiency and high power consumption during inference of generative models (e.g., LLMs and diffusion Transformers). Method: This work proposes a novel TPU architecture integrating digital Compute-in-Memory (CIM) units directly into the matrix multiplication unit (MXU), replacing the traditional systolic array to fundamentally alleviate the von Neumann bottleneck. We further co-design a customized microarchitectural model and a hardware-aware mapping methodology tailored for generative models, enabling end-to-end optimization in simulation. Contribution/Results: Compared to the TPUv4i baseline, our architecture achieves 44.2% and 33.8% higher inference throughput for LLMs and diffusion Transformers, respectively, while reducing MXU energy consumption by 27.3×—yielding substantial improvements in both energy efficiency and computational throughput.

Technology Category

Application Category

📝 Abstract

With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.

Problem

Research questions and friction points this paper is trying to address.

Efficient deployment of generative models on TPUs

Reducing power consumption in TPU architectures

Improving performance and energy efficiency using Compute-in-Memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates digital Compute-in-Memory in TPUs

Replaces conventional digital systolic arrays

Achieves significant energy and performance improvements

🔎 Similar Papers

Characterizing and Efficiently Accelerating Multimodal Generation Model Inference