POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

📅 2024-10-23

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

231K/year

🤖 AI Summary

In LLM inference, the prefill (computation-intensive) and decode (memory-bandwidth-intensive) phases exhibit mismatched resource demands, leading to suboptimal GPU utilization. While existing systems employ hybrid batching, their prefill and decode attention kernels remain isolated, hindering joint optimization. This work introduces the first unified attention kernel supporting hybrid batching, breaking the conventional paradigm that mandates separate optimization of the two phases. It enables full temporal overlap of prefill and decode execution across GPU streaming multiprocessors. Through fine-grained, CUDA-kernel-level resource scheduling and co-design of computation and memory bandwidth, it achieves, for the first time at the hardware level, deep integration of compute and memory access patterns across both phases. Experiments demonstrate up to 59% (average 28%) acceleration in attention computation, significantly improving throughput and reducing end-to-end latency.

Technology Category

Application Category

📝 Abstract

Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. This approach optimizes linear operations but remains inefficient for attention computation because existing attention kernels specialize execution independently for the prefill and decode phases. In this paper, we present POD-Attention - the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. POD-Attention speeds up attention computation by up to $59%$ (mean $28%$), enabling higher throughput and lower latency LLM inference compared to the use of independently optimized prefill and decode attention kernels.

Problem

Research questions and friction points this paper is trying to address.

Optimizes attention computation for hybrid batching

Maximizes GPU resource utilization

Enhances LLM inference speed and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

POD-Attention GPU kernel

Concurrent prefill and decode

Improved compute and memory utilization

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling