Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

📅 2025-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead and low inference efficiency of large language models (LLMs) during the prefill phase for long-context inputs, this paper proposes EHPC—a training-free, hierarchical feed-forward prompt compression method. EHPC identifies and leverages naturally occurring “evaluator heads” in pretrained Transformers, integrating attention analysis, inter-layer early exit, and dynamic token importance scoring to perform keyword token selection without fine-tuning or cache reconstruction. Its key contribution lies in the first discovery and systematic utilization of these evaluator heads for efficient, parameter-free prompt compression. EHPC significantly reduces prefill complexity, achieving state-of-the-art performance on both prompt compression and long-context inference acceleration benchmarks. It substantially lowers API-level computational cost while delivering acceleration comparable to key-value cache-based approaches.

Technology Category

Application Category

📝 Abstract
Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance. To address this challenge, we propose an efficient, training-free prompt compression method that retains key information within compressed prompts. We identify specific attention heads in transformer-based LLMs, which we designate as evaluator heads, that are capable of selecting tokens in long inputs that are most significant for inference. Building on this discovery, we develop EHPC, an Evaluator Head-based Prompt Compression method, which enables LLMs to rapidly"skim through"input prompts by leveraging only the first few layers with evaluator heads during the pre-filling stage, subsequently passing only the important tokens to the model for inference. EHPC achieves state-of-the-art results across two mainstream benchmarks: prompt compression and long-context inference acceleration. Consequently, it effectively reduces the complexity and costs associated with commercial API calls. We further demonstrate that EHPC attains competitive results compared to key-value cache-based acceleration methods, thereby highlighting its potential to enhance the efficiency of LLMs for long-context tasks.
Problem

Research questions and friction points this paper is trying to address.

Long Sentence Processing
Computational Resources
Model Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

EHPC
Transformer Model
Long Text Compression
🔎 Similar Papers
No similar papers found.
Weizhi Fei
Weizhi Fei
Department of Mathematical Sciences, Tsinghua University
Knowledge GraphNatural language process
Xueyan Niu
Xueyan Niu
Theory Lab, 2012 Labs, Huawei Technologies Co., Ltd.
information theorymachine learningcommunication
G
Guoqing Xie
Architecture & Design, ICT Products & Solutions, Huawei Technologies Co., Ltd.
Y
Yingqing Liu
Architecture & Design, ICT Products & Solutions, Huawei Technologies Co., Ltd.
B
Bo Bai
Theory Lab, 2012 Labs, Huawei Technologies Co., Ltd.
W
Wei Han
Theory Lab, 2012 Labs, Huawei Technologies Co., Ltd.