Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In retrieval-augmented generation (RAG), retrieved passages are often lengthy and noisy, frequently exceeding large language model (LLM) input limits; existing compression methods rely on training task-specific models, incurring high computational costs and poor portability. This paper proposes a training-free, lightweight sentence-level compression framework that formulates context filtering as a query-aware attention comprehension task. We innovatively leverage the native decoder self-attention signals of a 0.5B-parameter surrogate LLM as an unsupervised relevance probe, combined with a lightweight classifier to score sentence importance—enabling zero-shot cross-model transfer. Evaluated on LongBench, our method achieves up to 5× compression while matching the QA performance of a 7B supervised compression system, significantly reducing computational overhead and deployment complexity.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$ imes$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.
Problem

Research questions and friction points this paper is trying to address.

Compressing lengthy retrieved passages for LLM input limits
Reducing cost and portability issues in supervised compression models
Identifying sentence relevance via attention probing for efficient compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight sentence-level compression framework
Probes decoder attention with proxy LLM
Uses classifier for sentence relevance identification
🔎 Similar Papers
2024-10-05Conference on Empirical Methods in Natural Language ProcessingCitations: 0
Y
Yong Zhang
Ping An Technology (Shenzhen) Co., Ltd., China
Yanwen Huang
Yanwen Huang
PhD Candidate, Department of Pharmaceutical Sciences, Peking University
Ning Cheng
Ning Cheng
TeraHop
Y
Yang Guo
Ping An Technology (Shenzhen) Co., Ltd., China
Y
Yun Zhu
Ping An Technology (Shenzhen) Co., Ltd., China
Y
Yanmeng Wang
Ping An Technology (Shenzhen) Co., Ltd., China
Shaojun Wang
Shaojun Wang
Soochow University, TU/e, University of Strasbourg
NanophotonicsLight-matter interactionsNanofabrication
J
Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd., China