🤖 AI Summary
This work addresses severe GPU memory fragmentation and low cache efficiency in heterogeneous large language model (LLM) inference, caused by disparities in embedding dimensions, attention mechanisms, and memory access patterns. To tackle this, we propose a two-level memory management framework tailored for LLM inference. Its core innovations include: (i) a unified memory block granularity aligned to the least common multiple (LCM) of embedding sizes, enabling efficient allocation across diverse models; (ii) a programmable memory allocator supporting layer-customized caching and eviction policies; and (iii) a layer-aware caching API deeply integrated into the vLLM inference engine. Experimental evaluation demonstrates up to 79.6% improvement in GPU memory utilization and up to 4.92× higher service throughput (1.80× on average), significantly outperforming state-of-the-art approaches.
📝 Abstract
Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to improve the efficiency of memory management, we find that the growing heterogeneity in the embeddings dimensions, attention, and access patterns of modern LLM architectures introduces new challenges for memory allocation. In this paper, we present Jenga, a novel memory allocation framework for heterogeneous embeddings in LLMs. Jenga tackles two key challenges: (1) minimizing memory fragmentation when managing embeddings of different sizes, and (2) enabling flexible caching and eviction policies tailored to the specific token-dependency patterns of various layers. Jenga employs a two-level memory allocator, leveraging the least common multiple (LCM) of embedding sizes to optimize memory usage and providing APIs to express layer-specific caching logic to enhance memory reuse. We implemente Jenga on vLLM, a state-of-the-art LLM inference engine, and evaluate it with diverse LLMs, datasets, and GPU configurations. Evaluations show that Jenga improves GPU memory utilization by up to 79.6%, and increases serving throughput by up to 4.92x (1.80x on average).