SAGE: Selective Attention-Guided Extraction for Token-Efficient

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the high computational cost of processing full-length documents and the limited effectiveness of conventional retrieval-augmented approaches in long-document, multi-query question answering. The authors propose a training-free, plug-and-play context compression framework that leverages a lightweight local large language model to generate query-specific attention heatmaps via a single forward pass. Innovatively employing a differential attention strategy, the method precisely identifies and retains critical evidence spans, constructing a condensed context within a user-specified token budget. Experimental results demonstrate that this approach significantly outperforms existing compression techniques across multiple long-document QA benchmarks. Notably, it achieves top-four performance on the QuALITY-hard leaderboard using only 10% of the original context (i.e., a 90% token reduction) while maintaining high answer accuracy.

Technology Category

Application Category

📝 Abstract

Large language models with long context windows can answer complex questions directly from full-length academic, technical, and policy documents, but passing entire documents is often costly, slow, and can degrade answer quality while increasing the risk of unnecessary data leakage. This paper targets the common setting of answering many heterogeneous questions over long document(s), where fixed position heuristics and standard retrieval-augmented generation (RAG) can fail due to document structure variability and weak query-chunk semantic similarity, which often requires task- and domain-specific tuning of embedding retrievers. We propose {Selective Attention-Guided Extraction} (\ourmethod), a training-free, plug-and-play context reduction framework that uses a lightweight local LLM to perform a single prefilling pass and convert language model attention signals into a query-specific relevance heatmap at configurable granularities. \ourmethod\ further introduces \emph{differential attention} strategies to better isolate question-relevant evidence, then selects the top-scoring units under a user-defined token budget and forwards only this reduced context to a downstream LLM for answer generation. \ourmethod\ surpasses traditional reduction techniques across multiple long-document QA benchmarks, notably securing a top-4 rank on QuALITY-hard while constrained to a 10\% context budget. This enables a 90\% reduction in tokens with competitive accuracy, without the need for model fine-tuning or complex calibration.

Problem

Research questions and friction points this paper is trying to address.

long-document question answering

context reduction

retrieval-augmented generation

token efficiency

attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Attention-Guided Extraction

differential attention

token-efficient LLM