🤖 AI Summary
To address two critical bottlenecks in large language models (LLMs)—high prefill computation overhead and performance degradation due to “intermediate token loss” in long-sequence processing—this paper proposes a natural-language-instruction-guided dynamic context compression framework. The framework introduces a novel language-modeling-head-driven token critique mechanism, integrated with bidirectional reasoning layers, multi-granularity semantic filtering, and window-parallel inference, enabling semantic-aware context compression, dynamic pruning, and efficient reweighting. Evaluated on benchmarks spanning 4K–2M tokens, it achieves an average compression ratio of 21.59× while improving task performance by 19.15 points. When deployed with Qwen2.5-32B, the method surpasses leading proprietary models on Ruler-128K and InfiniteBench, establishing new state-of-the-art results.
📝 Abstract
This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the"lost in the middle"performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$ imes$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.