🤖 AI Summary
To address the high inference cost, latency, and “lost-in-the-middle” phenomenon induced by long contexts in large language models (LLMs), this work—grounded in information bottleneck theory—formulates context compression as maximizing query-conditioned mutual information, marking the first principled departure from conventional redundant-token removal paradigms. We propose a cross-attention-based framework for tractable mutual information approximation, enabling modular architecture design and component substitution. Additionally, we introduce importance-aware reweighting and dynamic truncation strategies to preserve task-critical information. Evaluated on four benchmark datasets, our method achieves a 25% higher compression ratio than state-of-the-art approaches while maintaining or even improving question-answering accuracy; in several scenarios, performance with compressed context surpasses that of the full original context.
📝 Abstract
Generative LLM have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the"lost in the middle"problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or PPL, which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.