Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Transformer and post-Transformer architectures (e.g., SSMs, linear attention, RNNs) face severe memory bandwidth bottlenecks during long-context inference. Method: This paper proposes a unified Processing-in-Memory (PIM) acceleration architecture supporting efficient inference across multi-paradigm large language models. Its core innovation is a dual-bank shared State Processing Unit (SPU) array that jointly optimizes attention computation and state updates via MX low-precision quantization and element-wise multiply-accumulate operations; it further employs an interleaved multi-bank PIM access strategy to balance computational efficiency and hardware cost. Contribution/Results: Experiments demonstrate up to 3.2× and 2.1× higher token-generation throughput compared to state-of-the-art GPUs and GPU+PIM hybrid systems, respectively. The architecture establishes a scalable, cost-effective hardware paradigm for heterogeneous LLM inference.

Technology Category

Application Category

📝 Abstract

Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility. Yet, their compute and memory costs grow with sequence length, posing scalability challenges for long-context inferencing. In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs), which we refer to as post-transformers. This shift presents a key challenge: building a serving system that efficiently supports both transformer and post-transformer LLMs within a unified framework. To address this challenge, we analyze the performance characteristics of transformer and post-transformer LLMs. Despite their algorithmic differences, both are fundamentally limited by memory bandwidth under batched inference due to attention in transformers and state updates in post-transformers. Further analyses suggest two additional insights: (1) state update operations, unlike attention, incur high hardware cost, making per-bank PIM acceleration inefficient, and (2) different low-precision arithmetic methods offer varying accuracy-area tradeoffs, while we identify Microsoft's MX as the Pareto-optimal choice. Building on these insights, we design Pimba as an array of State-update Processing Units (SPUs), each shared between two banks to enable interleaved access to PIM. Each SPU includes a State-update Processing Engine (SPE) that comprises element-wise multipliers and adders using MX-based quantized arithmetic, enabling efficient execution of state update and attention operations. Our evaluation shows that, compared to LLM-optimized GPU and GPU+PIM systems, Pimba achieves up to 3.2x and 2.1x higher token generation throughput, respectively.

Problem

Research questions and friction points this paper is trying to address.

Efficiently serving transformer and post-transformer LLMs in a unified framework

Overcoming memory bandwidth limitations in batched LLM inference

Optimizing hardware cost and accuracy for state update operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Processing-in-Memory acceleration for post-transformers

State-update Processing Units shared between banks

MX-based quantized arithmetic for efficient operations

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval