PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limitations of existing context compression methods, which rely on fixed compression ratios and often suffer from unpredictable performance degradation, hindering practical deployment. To overcome this, the authors propose PoC, a performance-oriented adaptive compression framework that shifts the objective from achieving a predetermined compression ratio to ensuring model performance remains above an acceptable threshold. PoC incorporates a lightweight performance predictor—available in both context-agnostic and context-aware variants—to automatically determine the optimal compression ratio and guide off-the-shelf compressors accordingly. Experimental results on question answering and summarization tasks demonstrate that the context-aware predictor substantially reduces prediction error, enabling PoC to achieve higher compression efficiency while preserving model performance.

Technology Category

Application Category

📝 Abstract

While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input's inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.

Problem

Research questions and friction points this paper is trying to address.

context compression

large language models

performance degradation

compression ratio

inference costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Performance-oriented Context Compression

Performance Prediction

Context Compression