ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face fundamental scalability challenges with ultra-long contexts (e.g., 1M tokens), owing to quadratic computational complexity, excessive memory consumption, and limited effective context length. To address this, we propose Intermediate-Layer Retrieval (ILRe): an offline-selected decoder intermediate layer is leveraged for streaming chunked prefilling, combined with attention-score-driven key-token recall. We further introduce a novel multi-pooling kernel allocation strategy that reduces prefilling complexity from $O(L^2)$ to $O(L)$, without requiring post-training or custom operators. Evaluated on the Llama-3.1-UltraLong-8B-1M-Instruct model deployed on Ascend 910B, ILRe processes a single 1M-token request in under 30 seconds—achieving ~180× inference speedup over baseline methods—while attaining a RULER-1M score of 79.8, matching full-context performance.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$, but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $approx 180 imes$) and scores RULER-$1M$ benchmark of $approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.
Problem

Research questions and friction points this paper is trying to address.

Compresses long contexts to overcome LLM memory limits
Reduces quadratic complexity to linear for efficiency
Maintains semantic integrity while accelerating processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intermediate Layer Retrieval pipeline for context compression
Multi-pooling kernels strategy for semantic completeness
Linear prefilling complexity with comparable performance
🔎 Similar Papers
No similar papers found.
M
Manlai Liang
AI Lab, China Merchants Bank, China
Mandi Liu
Mandi Liu
Carnegie Mellon University
Bayesian InferenceMatrix completionAHPBrain model
J
Jiangzhou Ji
AI Lab, China Merchants Bank, China
H
Huaijun Li
AI Lab, China Merchants Bank, China
H
Haobo Yang
AI Lab, China Merchants Bank, China
Y
Yaohan He
AI Lab, China Merchants Bank, China
J
Jinlong Li
AI Lab, China Merchants Bank, China