FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address high KV-cache transmission latency, fragmented memory layouts causing frequent kernel invocations, and computational load imbalance arising from rigid prefill/decode node role assignment in distributed LLM inference, this paper proposes a low-latency KV-cache transmission and load-aware scheduling framework. We innovatively design a block-wise contiguous KV memory layout and a dynamic prefill/decode (PD) node role allocation mechanism, integrated with lightweight transmission kernels, a load-aware scheduling algorithm, and dynamic resource orchestration. Experimental results demonstrate that the average KV-cache transmission latency is reduced to 0.053 seconds—a 96% improvement—LongBench inference speedup ranges from 15.2% to 48.9%, system throughput reaches peak capacity, resource utilization significantly increases, and the framework supports heterogeneous GPUs and extreme overload scenarios.

Technology Category

Application Category

📝 Abstract

Disaggregated inference has become an essential framework that separates the prefill (P) and decode (D) stages in large language model inference to improve throughput. However, the KV cache transfer faces significant delays between prefill and decode nodes. The block-wise calling method and discontinuous KV cache memory allocation increase the number of calls to the transmission kernel. Additionally, existing frameworks often fix the roles of P and D nodes, leading to computational imbalances. In this paper, we propose FlowKV, a novel disaggregated inference framework, which reduces the average transmission latency of KV cache by 96%, from 0.944s to 0.053s, almost eliminating the transfer time relative to the total request latency by optimizing the KV cache transfer. FlowKV introduces the Load-Aware Scheduler for balanced request scheduling and flexible PD node allocation. This design maximizes hardware resource utilization, achieving peak system throughput across various scenarios, including normal, computational imbalance, and extreme overload conditions. Experimental results demonstrate that FlowKV significantly accelerates inference by 15.2%-48.9% on LongBench dataset compared to the baseline and supports applications with heterogeneous GPUs.

Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache transfer latency by 96%

Balances computational load with dynamic scheduling

Improves inference speed by 15.2%-48.9%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes KV cache transfer to reduce latency

Introduces Load-Aware Scheduler for balanced scheduling

Supports heterogeneous GPUs for flexible deployment

🔎 Similar Papers

Compute Or Load KV Cache? Why Not Both?