🤖 AI Summary
Large language models (LLMs) face a fundamental bottleneck in processing long contexts due to fixed positional encoding limits and the quadratic computational complexity of self-attention. To address this, we propose QD-LCIRC—a fine-tuning-free, iterative context compression mechanism. QD-LCIRC integrates query-aware salient span selection with attention-masked recursive compression, enabling dynamic, adaptive context reduction via lightweight projection layers while preserving global semantics and enhancing query relevance. Evaluated on 32K-context multi-document QA and long-range reasoning tasks, QD-LCIRC improves accuracy by 12.7%, reduces GPU memory consumption by 41%, and incurs only an 8% increase in inference latency. To our knowledge, this is the first work to synergistically combine query-dependent compression with an iterative architectural design, achieving efficient, scalable, and training-free long-context enhancement for LLMs.
📝 Abstract
While large language models (LLMs) excel in generating coherent and contextually rich outputs, their capacity to efficiently handle long-form contexts is limited by fixed-length position embeddings. Additionally, the computational cost of processing long sequences increases quadratically, making it challenging to extend context length. To address these challenges, we propose Long-form Context Injection with Recurrent Compression (LCIRC), a method that enables the efficient processing long-form sequences beyond the model's length limit through recurrent compression without retraining the entire model. We further introduce query dependent context modeling, which selectively compresses query-relevant information, ensuring that the model retains the most pertinent content. Our empirical results demonstrate that Query Dependent LCIRC (QD-LCIRC) significantly improves LLM's ability to manage extended contexts, making it well-suited for tasks that require both comprehensive context understanding and query relevance.