CompLLM: Compression for Long Context Q&A

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing soft compression methods for long-context processing in large language models (LLMs) suffer from quadratic self-attention complexity and poor reusability and scalability. To address this, we propose a segmented soft compression approach: input context is partitioned into non-overlapping segments, each independently mapped to a low-dimensional latent representation; the key-value (KV) cache is then optimized to enable cross-query sharing. This breaks the conventional holistic compression paradigm, reducing computational complexity from O(n²) to O(n), enabling linear scalability and support for ultra-long contexts. Experiments demonstrate that, under high-context-length regimes, our method reduces first-token latency by 4× and cuts KV cache memory footprint by 50%, while maintaining or even surpassing the question-answering accuracy of the original uncompressed input. These gains significantly enhance deployment efficiency and system scalability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context compression methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic compression complexity and an inability to reuse computations across queries with overlapping contexts. In this work, we introduce CompLLM, a soft compression technique designed for practical deployment. Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the compression step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be cached and reused across different queries. Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%. Furthermore, CompLLM achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.

Problem

Research questions and friction points this paper is trying to address.

LLMs face computational challenges with long contexts due to quadratic attention complexity

Existing compression methods process context as single unit with quadratic complexity

Current techniques cannot reuse computations across queries with overlapping contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmental compression enabling linear scaling complexity

Independent compression allowing cached segment reuse

Training on short sequences generalizing to long contexts

🔎 Similar Papers

Perception Compressor: A Training-Free Prompt Compression Framework in Long Context Scenarios