LooComp: Leverage Leave-One-Out Strategy to Encoder-only Transformer for Efficient Query-aware Context Compression

πŸ“… 2026-03-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes a query-aware context compression method to reduce the inference cost of large language models in retrieval-augmented generation (RAG). It introduces leave-one-out evaluation into context pruning for the first time, employing a lightweight encoder-only Transformer to assess each sentence’s evidential contribution toward answering the query. The model is trained with a composite margin-ranking loss to effectively distinguish between critical and non-critical content. Experimental results demonstrate that the proposed approach achieves significantly higher compression ratios, improved inference throughput, and reduced memory consumption while maintaining high Exact-Match and F1 scores across multiple question-answering benchmarks, thereby enabling efficient yet accurate context compression.

Technology Category

Application Category

πŸ“ Abstract
Efficient context compression is crucial for improving the accuracy and scalability of question answering. For the efficiency of Retrieval Augmented Generation, context should be delivered fast, compact, and precise to ensure clue sufficiency and budget-friendly LLM reader cost. We propose a margin-based framework for query-driven context pruning, which identifies sentences that are critical for answering a query by measuring changes in clue richness when they are omitted. The model is trained with a composite ranking loss that enforces large margins for critical sentences while keeping non-critical ones near neutral. Built on a lightweight encoder-only Transformer, our approach generally achieves strong exact-match and F1 scores with high-throughput inference and lower memory requirements than those of major baselines. In addition to efficiency, our method yields effective compression ratios without degrading answering performance, demonstrating its potential as a lightweight and practical alternative for retrieval-augmented tasks.
Problem

Research questions and friction points this paper is trying to address.

context compression
question answering
Retrieval Augmented Generation
efficiency
query-aware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leave-One-Out
Encoder-only Transformer
Query-aware Context Compression
Margin-based Pruning
Retrieval Augmented Generation
πŸ”Ž Similar Papers
No similar papers found.
T
Thao Do
Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, South Korea
D
Dinh Phu Tran
Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, South Korea
An Vo
An Vo
Korea Advanced Institute of Science and Technology (KAIST)
Machine LearningComputer VisionNLPEvolutionary Computation
S
Seon Kwon Kim
Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, South Korea
Daeyoung Kim
Daeyoung Kim
Professor of School of Computing, KAIST
Cloud ComputingInternet of ThingsMachine Learning