LooComp: Leverage Leave-One-Out Strategy to Encoder-only Transformer for Efficient Query-aware Context Compression

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work proposes a query-aware context compression method to reduce the inference cost of large language models in retrieval-augmented generation (RAG). It introduces leave-one-out evaluation into context pruning for the first time, employing a lightweight encoder-only Transformer to assess each sentence’s evidential contribution toward answering the query. The model is trained with a composite margin-ranking loss to effectively distinguish between critical and non-critical content. Experimental results demonstrate that the proposed approach achieves significantly higher compression ratios, improved inference throughput, and reduced memory consumption while maintaining high Exact-Match and F1 scores across multiple question-answering benchmarks, thereby enabling efficient yet accurate context compression.

Technology Category

Application Category

📝 Abstract

Efficient context compression is crucial for improving the accuracy and scalability of question answering. For the efficiency of Retrieval Augmented Generation, context should be delivered fast, compact, and precise to ensure clue sufficiency and budget-friendly LLM reader cost. We propose a margin-based framework for query-driven context pruning, which identifies sentences that are critical for answering a query by measuring changes in clue richness when they are omitted. The model is trained with a composite ranking loss that enforces large margins for critical sentences while keeping non-critical ones near neutral. Built on a lightweight encoder-only Transformer, our approach generally achieves strong exact-match and F1 scores with high-throughput inference and lower memory requirements than those of major baselines. In addition to efficiency, our method yields effective compression ratios without degrading answering performance, demonstrating its potential as a lightweight and practical alternative for retrieval-augmented tasks.

Problem

Research questions and friction points this paper is trying to address.

context compression

question answering

Retrieval Augmented Generation

efficiency

query-aware

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leave-One-Out

Encoder-only Transformer

Query-aware Context Compression