FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

📅 2025-10-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from high inference cost, latency, and carbon footprint due to excessively long inputs, primarily caused by low-utility redundant tokens in prompts. Method: We propose a token-importance–based prompt compression framework that uniquely integrates two state-of-the-art attribution methods—GlobEnc and DecompX—to systematically characterize task-specific sensitivity to context sparsity and quantify the trade-off between semantic reconstruction fidelity and sequential dependency preservation. Compression is achieved via significance scoring, Top-k% token selection, and order-preserving pruning. Contribution/Results: Extensive evaluation across multiple tasks on frontier LLMs shows minimal performance degradation (<2% accuracy drop) under 20% compression for sentiment analysis, commonsense QA, and summarization; however, mathematical reasoning degrades significantly, revealing fundamental differences in context completeness requirements across tasks. Our work establishes an interpretable, generalizable theoretical foundation and practical methodology for efficient prompt engineering.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: https://github.com/Starscream-11813/Frugal-ICL
Problem

Research questions and friction points this paper is trying to address.

Reducing redundant tokens in LLM prompts to lower costs and latency
Identifying high-salience tokens using GlobEnc and DecompX attribution methods
Evaluating performance-efficiency trade-offs across diverse NLP tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

FrugalPrompt compresses prompts by retaining top semantic tokens
It uses GlobEnc and DecompX token attribution methods
Reduces prompts by 20% with minimal performance loss
🔎 Similar Papers
No similar papers found.