🤖 AI Summary
This work addresses the limitations of existing context compression methods, which rely on fixed compression ratios and often suffer from unpredictable performance degradation, hindering practical deployment. To overcome this, the authors propose PoC, a performance-oriented adaptive compression framework that shifts the objective from achieving a predetermined compression ratio to ensuring model performance remains above an acceptable threshold. PoC incorporates a lightweight performance predictor—available in both context-agnostic and context-aware variants—to automatically determine the optimal compression ratio and guide off-the-shelf compressors accordingly. Experimental results on question answering and summarization tasks demonstrate that the context-aware predictor substantially reduces prediction error, enabling PoC to achieve higher compression efficiency while preserving model performance.
📝 Abstract
While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input's inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.