🤖 AI Summary
To address the fundamental tension among heterogeneous KV cache storage, dynamic request loads, and static compilation constraints in large language model (LLM) inference, this paper proposes an efficient, customizable attention engine for LLM serving. Our method introduces: (1) a novel block-sparse and composable KV cache format enabling fine-grained memory reuse and cross-request KV sharing; (2) JIT-compiled, customizable attention templates for operator-level flexibility; and (3) a CUDA Graph–compatible, load-aware dynamic scheduling algorithm. Through GPU kernel-level co-optimization, we integrate our engine into mainstream frameworks—including vLLM and SGLang—achieving up to 29–69% reduction in token latency, 28–30% lower latency for long-context inference, and 13–17% higher parallel generation throughput. These results demonstrate substantial improvements in both latency and throughput under realistic, dynamic serving workloads.
📝 Abstract
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.