🤖 AI Summary
Large language models face deployment bottlenecks in long-context reasoning due to excessive memory overhead from key-value (KV) cache storage. Existing sequence-level compression methods discard entire tokens coarsely, risking loss of critical contextual information. This paper proposes GistCache, a fine-grained sequence-level compression framework that replaces original tokens with learnable compression tokens (“gists”) to preserve fine-grained details and long-range dependencies while significantly reducing KV cache size. GistCache introduces a novel chunk-free training paradigm and a gist shift mechanism, and implements hardware-aligned GPU kernels enabling dynamic gist eviction during inference for real-time memory release. Experiments across multiple long-context benchmarks—especially those requiring detailed recall and modeling of distant dependencies—demonstrate that GistCache substantially outperforms baselines, achieving up to 2.5× KV memory reduction without compromising model accuracy.
📝 Abstract
Large language models are increasingly capable of handling long-context inputs, but the memory overhead of key-value (KV) cache remains a major bottleneck for general-purpose deployment. While various compression strategies have been explored, sequence-level compression, which drops the full KV caches for certain tokens, is particularly challenging as it can lead to the loss of important contextual information. To address this, we introduce UniGist, a sequence-level long-context compression framework that efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner. We adopt a chunk-free training strategy and design an efficient kernel with a gist shift trick, enabling optimized GPU training. Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings. Experiments across multiple long-context tasks demonstrate that UniGist significantly improves compression quality, with especially strong performance in detail-recalling tasks and long-range dependency modeling.