SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

๐Ÿ“… 2024-10-04
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address high prefill computational overhead and latency in enterprise-scale LLM inference with long prompts, this paper proposes a knowledge-preserving model transformation and distillation framework. Methodologically, it introduces (1) a novel cross-layer KV cache reuse mechanism that enables prompt tokens to bypass deeper Transformer layers, and (2) a synergistic integration of lightweight knowledge distillation with KV cache compression, accompanied by a restructured layer-mapping strategy. Evaluated on Llama-3.1-70B, the approach reduces prefill computation by 25โ€“50%, doubles end-to-end throughput, cuts per-token latency by 60%, and achieves a peak throughput of 16K tokens/s (560 TFLOPs/GPU). The framework thus simultaneously enhances inference speed, generation quality, and memory efficiencyโ€”without compromising model fidelity.

Technology Category

Application Category

๐Ÿ“ Abstract
LLM inference for enterprise applications, such as summarization, RAG, and code-generation, typically observe much longer prompt than generations, leading to high prefill cost and response latency. We present SwiftKV, a novel model transformation and distillation procedure targeted at reducing the prefill compute (in FLOPs) of prompt tokens while preserving high generation quality. First, SwiftKV prefills later layers' KV cache using an earlier layer's output, allowing prompt tokens to skip those later layers. Second, SwiftKV employs a lightweight knowledge-preserving distillation procedure that can adapt existing LLMs with minimal accuracy impact. Third, SwiftKV can naturally incorporate KV cache compression to improve inference performance in low-memory scenarios. Our comprehensive experiments show that SwiftKV can effectively reduce prefill computation by 25-50% across several LLM families while incurring minimum quality degradation. In the end-to-end inference serving, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B. SwiftKV is open-sourced at https://github.com/snowflakedb/arctictraining.
Problem

Research questions and friction points this paper is trying to address.

Reduces prefill compute cost for long-prompt LLM inference
Preserves generation quality via knowledge-preserving distillation
Enables KV cache compression for low-memory scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefills later layers' KV cache using earlier layers
Employs lightweight knowledge-preserving distillation
Incorporates KV cache compression for low-memory
๐Ÿ”Ž Similar Papers
No similar papers found.