π€ AI Summary
This work proposes a query-aware context compression method to reduce the inference cost of large language models in retrieval-augmented generation (RAG). It introduces leave-one-out evaluation into context pruning for the first time, employing a lightweight encoder-only Transformer to assess each sentenceβs evidential contribution toward answering the query. The model is trained with a composite margin-ranking loss to effectively distinguish between critical and non-critical content. Experimental results demonstrate that the proposed approach achieves significantly higher compression ratios, improved inference throughput, and reduced memory consumption while maintaining high Exact-Match and F1 scores across multiple question-answering benchmarks, thereby enabling efficient yet accurate context compression.
π Abstract
Efficient context compression is crucial for improving the accuracy and scalability of question answering. For the efficiency of Retrieval Augmented Generation, context should be delivered fast, compact, and precise to ensure clue sufficiency and budget-friendly LLM reader cost. We propose a margin-based framework for query-driven context pruning, which identifies sentences that are critical for answering a query by measuring changes in clue richness when they are omitted. The model is trained with a composite ranking loss that enforces large margins for critical sentences while keeping non-critical ones near neutral. Built on a lightweight encoder-only Transformer, our approach generally achieves strong exact-match and F1 scores with high-throughput inference and lower memory requirements than those of major baselines. In addition to efficiency, our method yields effective compression ratios without degrading answering performance, demonstrating its potential as a lightweight and practical alternative for retrieval-augmented tasks.