PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency of static processing-in-memory (PIM) architectures in large language model (LLM) decoding—caused by dynamically varying compute and memory access patterns—this paper proposes a runtime-adaptive heterogeneous PIM acceleration architecture. Our method introduces: (1) an online kernel feature identification and dynamic hardware mapping mechanism; (2) hybrid-capability PIM units co-designed with general-purpose compute units to overcome limitations of static mapping and homogeneous PIM designs; and (3) a heterogeneous system architecture supporting dynamic kernel characterization and real-time scheduling. Evaluated on three mainstream LLMs, our architecture achieves 1.8× and 11.1× end-to-end decoding speedup over state-of-the-art heterogeneous accelerators and pure PIM accelerators, respectively.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8$ imes$ and 11.1$ imes$ speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.
Problem

Research questions and friction points this paper is trying to address.

Dynamic scheduling of LLM decoding kernels
Optimizing PIM and accelerator utilization
Improving speed in heterogeneous computing systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic kernel scheduling
PIM-enabled heterogeneous architecture
Online kernel characterization
🔎 Similar Papers
No similar papers found.