Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge that personalized large language models struggle to efficiently leverage lengthy user history due to input length constraints, resulting in high inference latency and API costs. To overcome this, the authors propose an attention-based context compression framework that, for the first time, exploits the internal attention patterns of a token-level model to identify task-relevant personalized information. This insight guides a compression model to generate high-fidelity, condensed context representations. Unlike conventional heuristic whole-context compression approaches, the method enables fine-grained, task-aware information selection. Experimental results demonstrate that the approach achieves performance nearly on par with using the full context while retaining only 1/50 of the original tokens, substantially outperforming existing baselines across multiple tasks.

Technology Category

Application Category

📝 Abstract

Personalizing large language models (LLMs) to individual users requires incorporating extensive interaction histories and profiles, but input token constraints make this impractical due to high inference latency and API costs. Existing approaches rely on heuristic methods such as selecting recent interactions or prompting summarization models to compress user profiles. However, these methods treat context as a monolithic whole and fail to consider how LLMs internally process and prioritize different profile components. We investigate whether LLMs'attention patterns can effectively identify important personalization signals for intelligent context compression. Through preliminary studies on representative personalization tasks, we discover that (a) LLMs'attention patterns naturally reveal important signals, and (b) fine-tuning enhances LLMs'ability to distinguish between relevant and irrelevant information. Based on these insights, we propose Attn-GS, an attention-guided context compression framework that leverages attention feedback from a marking model to mark important personalization sentences, then guides a compression model to generate task-relevant, high-quality compressed user contexts. Extensive experiments demonstrate that Attn-GS significantly outperforms various baselines across different tasks, token limits, and settings, achieving performance close to using full context while reducing token usage by 50 times.

Problem

Research questions and friction points this paper is trying to address.

personalized LLMs

context compression

input token constraints

inference latency

user profiles

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention-guided compression

personalized LLMs

context compression