Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the high latency and computational overhead incurred by long prompts in large language models (LLMs), particularly in retrieval-augmented generation (RAG) scenarios, where the practical benefits of prompt compression remain unclear. We present the first large-scale empirical evaluation of end-to-end performance for prompt compression methods such as LLMLingua across diverse open-source LLMs and GPU hardware configurations. Our analysis comprehensively examines the trade-offs among compression overhead, decoding latency, output quality, and memory consumption. To facilitate this investigation, we introduce an open-source profiling tool capable of predicting the latency breakeven point for various model–hardware combinations. Experimental results demonstrate that, under suitable prompt lengths, compression ratios, and hardware settings, prompt compression can achieve up to 18% end-to-end speedup without significant degradation in output quality, while substantially reducing memory requirements—enabling migration from datacenter GPUs to consumer-grade hardware with only a 0.3-second latency penalty.
📝 Abstract
With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs and three GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model-hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.
Problem

Research questions and friction points this paper is trying to address.

prompt compression
LLM inference
latency
RAG systems
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt compression
LLM inference acceleration
latency optimization
memory efficiency
hardware-aware profiling