🤖 AI Summary
To address the high computational overhead and deployment costs incurred by large language models (LLMs) when processing long prompts, this paper systematically evaluates six prompt compression methods across text and multimodal tasks, achieving simultaneous prompt length reduction, cost savings, and output quality preservation. Methodologically, we introduce the first comprehensive empirical analysis of compression’s impact on hallucination, token omission, and long-context performance—revealing that moderate compression can even improve LLM performance on Longbench long-context benchmarks (up to +2.3%). Our evaluation framework spans 13 heterogeneous datasets—including news, scientific, commonsense reasoning, mathematical, QA, and VQA domains—and incorporates multidimensional metrics: generation quality, hallucination rate, and cross-modal robustness. All code and datasets are publicly released to facilitate reproducibility and community-driven extensions.
📝 Abstract
Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression methods for LLMs, aiming to reduce prompt length while maintaining LLM response quality. In this paper, we present a comprehensive analysis covering aspects such as generation performance, model hallucinations, efficacy in multimodal tasks, word omission analysis, and more. We evaluate these methods across 13 datasets, including news, scientific articles, commonsense QA, math QA, long-context QA, and VQA datasets. Our experiments reveal that prompt compression has a greater impact on LLM performance in long contexts compared to short ones. In the Longbench evaluation, moderate compression even enhances LLM performance. Our code and data is available at https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression.