eLLM: Elastic Memory Management Framework for Efficient LLM Serving

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address GPU memory fragmentation and low utilization arising from coexisting static weights, dynamic activations, and KV caches in LLM serving, this paper proposes a unified elastic memory management framework. Methodologically, it introduces: (1) a novel integration of virtual tensor abstraction and memory ballooning to enable runtime GPU memory scaling; (2) page-table-enhanced KV cache virtualization for fine-grained memory reuse; and (3) an SLO-aware lightweight scheduling policy that coordinates CPU-GPU memory inflation/deflation. Experiments demonstrate a 2.32× improvement in decoding throughput and support for batch sizes three times larger under 128K-token inputs, effectively mitigating up to ~20% throughput degradation caused by memory fragmentation.

Technology Category

Application Category

📝 Abstract
Large Language Models are increasingly being deployed in datacenters. Serving these models requires careful memory management, as their memory usage includes static weights, dynamic activations, and key-value caches. While static weights are constant and predictable, dynamic components such as activations and KV caches change frequently during runtime, presenting significant challenges for efficient memory management. Modern LLM serving systems typically handle runtime memory and KV caches at distinct abstraction levels: runtime memory management relies on static tensor abstractions, whereas KV caches utilize a page table-based virtualization layer built on top of the tensor abstraction. This virtualization dynamically manages KV caches to mitigate memory fragmentation. However, this dual-level approach fundamentally isolates runtime memory and KV cache management, resulting in suboptimal memory utilization under dynamic workloads, which can lead to a nearly 20% drop in throughput. To address these limitations, we propose eLLM, an elastic memory management framework inspired by the classical memory ballooning mechanism in operating systems. The core components of eLLM include: (1) Virtual Tensor Abstraction, which decouples the virtual address space of tensors from the physical GPU memory, creating a unified and flexible memory pool; (2) an Elastic Memory Mechanism that dynamically adjusts memory allocation through runtime memory inflation and deflation, leveraging CPU memory as an extensible buffer; and (3) a Lightweight Scheduling Strategy employing SLO-aware policies to optimize memory utilization and effectively balance performance trade-offs under stringent SLO constraints. Comprehensive evaluations demonstrate that eLLM significantly outperforms state-of-the-art systems, 2.32x higher decoding throughput, and supporting 3x larger batch sizes for 128K-token inputs.
Problem

Research questions and friction points this paper is trying to address.

Efficient memory management for LLM serving with dynamic components
Unified memory handling for static weights and dynamic KV caches
Optimizing throughput and batch size under SLO constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual Tensor Abstraction decouples address space
Elastic Memory Mechanism adjusts allocation dynamically
Lightweight Scheduling Strategy optimizes memory utilization
🔎 Similar Papers
No similar papers found.
Jiale Xu
Jiale Xu
Tencent ARC Lab
Generative Models3D Generation3D Reconstruction
R
Rui Zhang
Ant Group, China
Y
Yi Xiong
Ant Group, China
C
Cong Guo
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute, China
Z
Zihan Liu
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute, China
Yangjie Zhou
Yangjie Zhou
National University of Singapore
High performance computingDeep learning compilerComputer architecture
W
Weiming Hu
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute, China
H
Hao Wu
Ant Group, China
C
Changxu Shao
Ant Group, China
Z
Ziqing Wang
Ant Group, China
Y
Yongjie Yuan
Ant Group, China
J
Junping Zhao
Ant Group, China
Minyi Guo
Minyi Guo
IEEE Fellow, Chair Professor, Shanghai Jiao Tong University
Parallel ComputingCompiler OptimizationCloud ComputingNetworkingBig Data
Jingwen Leng
Jingwen Leng
Professor, Shanghai Jiao Tong University
Computer Architecture