UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Large language models face deployment bottlenecks in long-context reasoning due to excessive memory overhead from key-value (KV) cache storage. Existing sequence-level compression methods discard entire tokens coarsely, risking loss of critical contextual information. This paper proposes GistCache, a fine-grained sequence-level compression framework that replaces original tokens with learnable compression tokens (“gists”) to preserve fine-grained details and long-range dependencies while significantly reducing KV cache size. GistCache introduces a novel chunk-free training paradigm and a gist shift mechanism, and implements hardware-aligned GPU kernels enabling dynamic gist eviction during inference for real-time memory release. Experiments across multiple long-context benchmarks—especially those requiring detailed recall and modeling of distant dependencies—demonstrate that GistCache substantially outperforms baselines, achieving up to 2.5× KV memory reduction without compromising model accuracy.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly capable of handling long-context inputs, but the memory overhead of key-value (KV) cache remains a major bottleneck for general-purpose deployment. While various compression strategies have been explored, sequence-level compression, which drops the full KV caches for certain tokens, is particularly challenging as it can lead to the loss of important contextual information. To address this, we introduce UniGist, a sequence-level long-context compression framework that efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner. We adopt a chunk-free training strategy and design an efficient kernel with a gist shift trick, enabling optimized GPU training. Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings. Experiments across multiple long-context tasks demonstrate that UniGist significantly improves compression quality, with especially strong performance in detail-recalling tasks and long-range dependency modeling.

Problem

Research questions and friction points this paper is trying to address.

Compressing KV cache to reduce memory overhead

Preserving context information during sequence-level compression

Enabling efficient GPU training and flexible inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Special compression tokens replace raw tokens

Chunk-free training with gist shift trick

Flexible inference enables real-time memory savings

🔎 Similar Papers

No similar papers found.