Jenga: Effective Memory Management for Serving LLM with Heterogeneity

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses severe GPU memory fragmentation and low cache efficiency in heterogeneous large language model (LLM) inference, caused by disparities in embedding dimensions, attention mechanisms, and memory access patterns. To tackle this, we propose a two-level memory management framework tailored for LLM inference. Its core innovations include: (i) a unified memory block granularity aligned to the least common multiple (LCM) of embedding sizes, enabling efficient allocation across diverse models; (ii) a programmable memory allocator supporting layer-customized caching and eviction policies; and (iii) a layer-aware caching API deeply integrated into the vLLM inference engine. Experimental evaluation demonstrates up to 79.6% improvement in GPU memory utilization and up to 4.92× higher service throughput (1.80× on average), significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to improve the efficiency of memory management, we find that the growing heterogeneity in the embeddings dimensions, attention, and access patterns of modern LLM architectures introduces new challenges for memory allocation. In this paper, we present Jenga, a novel memory allocation framework for heterogeneous embeddings in LLMs. Jenga tackles two key challenges: (1) minimizing memory fragmentation when managing embeddings of different sizes, and (2) enabling flexible caching and eviction policies tailored to the specific token-dependency patterns of various layers. Jenga employs a two-level memory allocator, leveraging the least common multiple (LCM) of embedding sizes to optimize memory usage and providing APIs to express layer-specific caching logic to enhance memory reuse. We implemente Jenga on vLLM, a state-of-the-art LLM inference engine, and evaluate it with diverse LLMs, datasets, and GPU configurations. Evaluations show that Jenga improves GPU memory utilization by up to 79.6%, and increases serving throughput by up to 4.92x (1.80x on average).
Problem

Research questions and friction points this paper is trying to address.

Efficient GPU memory management for heterogeneous LLM embeddings
Minimizing memory fragmentation in variable-sized LLM embeddings
Optimizing caching policies for layer-specific token dependencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-level memory allocator for heterogeneous embeddings
LCM-based optimization for memory usage
Layer-specific caching logic APIs
🔎 Similar Papers
No similar papers found.