QwenLong-CPRS: Towards $infty$-LLMs with Dynamic Context Optimization

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two critical bottlenecks in large language models (LLMs)—high prefill computation overhead and performance degradation due to “intermediate token loss” in long-sequence processing—this paper proposes a natural-language-instruction-guided dynamic context compression framework. The framework introduces a novel language-modeling-head-driven token critique mechanism, integrated with bidirectional reasoning layers, multi-granularity semantic filtering, and window-parallel inference, enabling semantic-aware context compression, dynamic pruning, and efficient reweighting. Evaluated on benchmarks spanning 4K–2M tokens, it achieves an average compression ratio of 21.59× while improving task performance by 19.15 points. When deployed with Qwen2.5-32B, the method surpasses leading proprietary models on Ruler-128K and InfiniteBench, establishing new state-of-the-art results.

Technology Category

Application Category

📝 Abstract
This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the"lost in the middle"performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$ imes$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.
Problem

Research questions and friction points this paper is trying to address.

Reduces computation overhead in LLM prefill stage
Addresses performance drop in long sequence processing
Enables efficient multi-granularity context compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic context optimization for efficiency
Natural language-guided multi-granularity compression
Window-parallel inference architecture enhancement
🔎 Similar Papers
No similar papers found.
Weizhou Shen
Weizhou Shen
Tongyi Lab, Alibaba Group
C
Chenliang Li
Qwen-Doc Team, Alibaba Group
Fanqi Wan
Fanqi Wan
Sun Yat-sen University
NLPLLMs
S
Shengyi Liao
Qwen-Doc Team, Alibaba Group
S
Shaopeng Lai
Qwen-Doc Team, Alibaba Group
B
Bo Zhang
Qwen-Doc Team, Alibaba Group
Y
Yingcheng Shi
Qwen-Doc Team, Alibaba Group
Yuning Wu
Yuning Wu
Wayne State University
perceptions of crime & justicepolice attitudes and behaviorsvictimizationcriminological theorieslaw and society
Gang Fu
Gang Fu
Amazon
Machine LearningDeep LearningSemantic Network Analysis
Z
Zhansheng Li
Qwen-Doc Team, Alibaba Group
B
Bin Yang
Qwen-Doc Team, Alibaba Group
J
Ji Zhang
Qwen-Doc Team, Alibaba Group
F
Fei Huang
Qwen-Doc Team, Alibaba Group
Jingren Zhou
Jingren Zhou
Alibaba Group, Microsoft
Cloud ComputingLarge Scale Distributed SystemsMachine LearningQuery ProcessingQuery
M
Ming Yan
Qwen-Doc Team, Alibaba Group