QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory

📅 2024-08-20
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference cost, latency, and “lost-in-the-middle” phenomenon induced by long contexts in large language models (LLMs), this work—grounded in information bottleneck theory—formulates context compression as maximizing query-conditioned mutual information, marking the first principled departure from conventional redundant-token removal paradigms. We propose a cross-attention-based framework for tractable mutual information approximation, enabling modular architecture design and component substitution. Additionally, we introduce importance-aware reweighting and dynamic truncation strategies to preserve task-critical information. Evaluated on four benchmark datasets, our method achieves a 25% higher compression ratio than state-of-the-art approaches while maintaining or even improving question-answering accuracy; in several scenarios, performance with compressed context surpasses that of the full original context.

Technology Category

Application Category

📝 Abstract
Generative LLM have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the"lost in the middle"problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or PPL, which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.
Problem

Research questions and friction points this paper is trying to address.

Compressing long contexts to reduce computational costs and delays
Addressing task-irrelevant information in long contexts causing lost-in-middle issues
Improving token retention for query-conditioned compression using information bottleneck theory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Applies information bottleneck theory for context compression
Uses cross-attention to approximate mutual information
Achieves higher compression rates while maintaining performance
🔎 Similar Papers
No similar papers found.