vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

📅 2024-05-07

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 0

career value

204K/year

🤖 AI Summary

PagedAttention suffers from KV cache fragmentation, increased programming complexity, and reduced throughput due to non-contiguous physical memory allocation. Method: This paper proposes a page-free dynamic virtual memory management scheme that decouples virtual and physical memory allocation. It preserves virtual address contiguity of the KV cache while mitigating physical memory fragmentation via an LLM-optimized CUDA virtual memory scheduling algorithm. Contribution/Results: The design natively supports mainstream attention kernels—including FlashAttention and FlashInfer—ensuring simplicity, portability, and high performance. Experiments show up to 1.23× higher LLM serving throughput compared to PagedAttention, alongside significantly reduced system complexity and improved memory resource utilization.

Technology Category

Application Category

📝 Abstract

PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems. It enables on-demand allocation of GPU memory to mitigate KV cache fragmentation -- a phenomenon that crippled the batch size (and consequently throughput) in prior systems. However, in trying to allocate physical memory at runtime, PagedAttention ends up changing the virtual memory layout of the KV cache from contiguous to non-contiguous. Such a design leads to non-trivial programming and performance overheads. We present vAttention -- an approach that mitigates fragmentation in physical memory while retaining the contiguity of KV cache in virtual memory. We achieve this by decoupling the allocation of virtual and physical memory using CUDA virtual memory management APIs. We also introduce various LLM-specific optimizations to address the limitations of CUDA virtual memory support. Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention: it supports various attention kernels out-of-the-box and improves LLM serving throughput by up to 1.23x compared to the use of PagedAttention-based kernels of FlashAttention and FlashInfer.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Paged Attention

Memory Management

Innovation

Methods, ideas, or system contributions that make the work stand out.

vAttention

memory management optimization

large language model (LLM) performance enhancement

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval