HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

In disaggregated large language model (LLM) inference, KV cache transmission and dequantization overheads constitute critical bottlenecks—especially under long-context workloads. To address this, we propose KV Homomorphic Acceleration (KVA), the first framework enabling attention computation entirely within the quantized domain without dequantization. KVA achieves this via quantization-aware approximate matrix multiplication and ultra-low-bit KV compression, thereby fully decoupling quantization from computation. Specifically designed for disaggregated architectures, it integrates a lightweight homomorphic compute kernel that operates directly on quantized KV states. Experimental evaluation demonstrates that KVA reduces job completion time by up to 70.9% over baseline systems and improves throughput by 52.3% over state-of-the-art KV quantization methods. These gains represent a significant breakthrough in co-optimizing communication and computation in disaggregated LLM inference.

Technology Category

Application Category

📝 Abstract

Disaggregated Large Language Model (LLM) inference has gained popularity as it separates the computation-intensive prefill stage from the memory-intensive decode stage, avoiding the prefill-decode interference and improving resource utilization. However, transmitting Key-Value (KV) data between the two stages can be a bottleneck, especially for long prompts. Additionally, the computation time overhead for prefill and decode is key for optimizing Job Completion Time (JCT), and KV data size can become prohibitive for long prompts and sequences. Existing KV quantization methods can alleviate the transmission bottleneck and reduce memory requirements, but they introduce significant dequantization overhead, exacerbating the computation time. We propose Homomorphic Acceleration via Compression of the KV cache (HACK) for disaggregated LLM inference. HACK eliminates the heavy KV dequantization step, and directly performs computations on quantized KV data to approximate and reduce the cost of the expensive matrix-multiplication step. Extensive trace-driven experiments show that HACK reduces JCT by up to 70.9% compared to disaggregated LLM inference baseline and by up to 52.3% compared to state-of-the-art KV quantization methods.

Problem

Research questions and friction points this paper is trying to address.

Optimizes Key-Value cache transmission in LLM inference.

Reduces computation time overhead for prefill and decode stages.

Eliminates dequantization overhead in existing KV quantization methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Homomorphic Acceleration via Compression

Direct computation on quantized KV data

Reduces Job Completion Time significantly

🔎 Similar Papers

ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache