ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large language models (LLMs) face fundamental scalability challenges with ultra-long contexts (e.g., 1M tokens), owing to quadratic computational complexity, excessive memory consumption, and limited effective context length. To address this, we propose Intermediate-Layer Retrieval (ILRe): an offline-selected decoder intermediate layer is leveraged for streaming chunked prefilling, combined with attention-score-driven key-token recall. We further introduce a novel multi-pooling kernel allocation strategy that reduces prefilling complexity from $O(L^2)$ to $O(L)$, without requiring post-training or custom operators. Evaluated on the Llama-3.1-UltraLong-8B-1M-Instruct model deployed on Ascend 910B, ILRe processes a single 1M-token request in under 30 seconds—achieving ~180× inference speedup over baseline methods—while attaining a RULER-1M score of 79.8, matching full-context performance.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$, but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $approx 180 imes$) and scores RULER-$1M$ benchmark of $approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.

Problem

Research questions and friction points this paper is trying to address.

Compresses long contexts to overcome LLM memory limits

Reduces quadratic complexity to linear for efficiency

Maintains semantic integrity while accelerating processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intermediate Layer Retrieval pipeline for context compression

Multi-pooling kernels strategy for semantic completeness

Linear prefilling complexity with comparable performance

🔎 Similar Papers

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling