Jenga: Effective Memory Management for Serving LLM with Heterogeneity

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses severe GPU memory fragmentation and low cache efficiency in heterogeneous large language model (LLM) inference, caused by disparities in embedding dimensions, attention mechanisms, and memory access patterns. To tackle this, we propose a two-level memory management framework tailored for LLM inference. Its core innovations include: (i) a unified memory block granularity aligned to the least common multiple (LCM) of embedding sizes, enabling efficient allocation across diverse models; (ii) a programmable memory allocator supporting layer-customized caching and eviction policies; and (iii) a layer-aware caching API deeply integrated into the vLLM inference engine. Experimental evaluation demonstrates up to 79.6% improvement in GPU memory utilization and up to 4.92× higher service throughput (1.80× on average), significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to improve the efficiency of memory management, we find that the growing heterogeneity in the embeddings dimensions, attention, and access patterns of modern LLM architectures introduces new challenges for memory allocation. In this paper, we present Jenga, a novel memory allocation framework for heterogeneous embeddings in LLMs. Jenga tackles two key challenges: (1) minimizing memory fragmentation when managing embeddings of different sizes, and (2) enabling flexible caching and eviction policies tailored to the specific token-dependency patterns of various layers. Jenga employs a two-level memory allocator, leveraging the least common multiple (LCM) of embedding sizes to optimize memory usage and providing APIs to express layer-specific caching logic to enhance memory reuse. We implemente Jenga on vLLM, a state-of-the-art LLM inference engine, and evaluate it with diverse LLMs, datasets, and GPU configurations. Evaluations show that Jenga improves GPU memory utilization by up to 79.6%, and increases serving throughput by up to 4.92x (1.80x on average).

Problem

Research questions and friction points this paper is trying to address.

Efficient GPU memory management for heterogeneous LLM embeddings

Minimizing memory fragmentation in variable-sized LLM embeddings

Optimizing caching policies for layer-specific token dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-level memory allocator for heterogeneous embeddings

LCM-based optimization for memory usage

Layer-specific caching logic APIs

🔎 Similar Papers

No similar papers found.