🤖 AI Summary
To address excessive memory and computational overhead in long-context processing by large language models (LLMs), this paper proposes a sentence-anchored gist compression mechanism. It employs learnable compression tokens to semantically condense context while leveraging sentence-level anchoring for precise alignment, enabling efficient and controllable context reduction. The method significantly outperforms unsupervised compression baselines across 2×–8× compression ratios, maintaining stable performance on both short- and long-context benchmarks. Experiments on a 3B-parameter LLaMA model demonstrate that higher compression ratios incur no substantial performance degradation, striking an effective balance between compression efficiency and task fidelity. This approach establishes a lightweight, scalable paradigm for long-context inference—offering improved resource efficiency without sacrificing semantic integrity or downstream accuracy.
📝 Abstract
This work investigates context compression for Large Language Models (LLMs) using learned compression tokens to reduce the memory and computational demands of processing long sequences. We demonstrate that pre-trained LLMs can be fine-tuned to compress their context by factors of 2x to 8x without significant performance degradation, as evaluated on both short-context and long-context benchmarks. Furthermore, in experiments on a 3-billion-parameter LLaMA model, our method achieves results on par with alternative compression techniques while attaining higher compression ratios.