Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address severe information loss and excessive memory consumption in Transformer-based inference over ultra-long contexts (>1M tokens), this paper proposes a two-stage collaborative framework: “Compress–Aggregate–Recompute.” First, it incrementally compresses the KV cache while constructing cross-layer contextual embeddings to preserve global semantics. Second, it retrieves salient tokens via similarity search and triggers selective KV recomputation combined with early-exit inference. This work introduces the first cross-layer embedding modeling mechanism and a dynamic recomputation strategy for long-context Transformers. Evaluated on RULER and BABILong, the method improves accuracy by 50% and 27%, respectively; it also demonstrates strong generalization on Infinite-Bench and MM-NIAH. Moreover, it achieves a 30% inference speedup and reduces peak memory usage by 5%.

Technology Category

Application Category

📝 Abstract

As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.

Problem

Research questions and friction points this paper is trying to address.

Efficiently processing extremely long contexts in Transformers

Preserving information while compressing KV cache

Reducing memory usage and inference time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incremental processing with compressed KV cache

Cross-layer context embeddings construction

Selective recomputation via similarity matching

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models