QwenLong-CPRS: Towards $infty$-LLMs with Dynamic Context Optimization

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address two critical bottlenecks in large language models (LLMs)—high prefill computation overhead and performance degradation due to “intermediate token loss” in long-sequence processing—this paper proposes a natural-language-instruction-guided dynamic context compression framework. The framework introduces a novel language-modeling-head-driven token critique mechanism, integrated with bidirectional reasoning layers, multi-granularity semantic filtering, and window-parallel inference, enabling semantic-aware context compression, dynamic pruning, and efficient reweighting. Evaluated on benchmarks spanning 4K–2M tokens, it achieves an average compression ratio of 21.59× while improving task performance by 19.15 points. When deployed with Qwen2.5-32B, the method surpasses leading proprietary models on Ruler-128K and InfiniteBench, establishing new state-of-the-art results.

Technology Category

Application Category

📝 Abstract

This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the"lost in the middle"performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$ imes$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.

Problem

Research questions and friction points this paper is trying to address.

Reduces computation overhead in LLM prefill stage

Addresses performance drop in long sequence processing

Enables efficient multi-granularity context compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic context optimization for efficiency

Natural language-guided multi-granularity compression

Window-parallel inference architecture enhancement

🔎 Similar Papers

ReAttention: Training-Free Infinite Context with Finite Attention Scope