Clustering-driven Memory Compression for On-device Large Language Models

๐Ÿ“… 2026-01-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of on-device large language models, where naively concatenating user memories quickly exhausts limited context windows, while simple averaging for memory compression suffers from semantic conflicts that degrade personalized generation. To overcome this, the paper introducesโ€” for the first timeโ€”a clustering-based memory compression mechanism tailored for on-device settings. It groups memories according to the similarity of their embeddings, fuses memories within each cluster, and then integrates the compressed representations into the prompt in a context-aware manner. This approach substantially reduces the number of memory tokens while effectively mitigating semantic interference. Under a fixed context budget, it consistently outperforms both direct concatenation and average-based compression baselines, achieving a favorable balance between computational efficiency and the quality of personalized generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) often rely on user-specific memories distilled from past interactions to enable personalized generation. A common practice is to concatenate these memories with the input prompt, but this approach quickly exhausts the limited context available in on-device LLMs. Compressing memories by averaging can mitigate context growth, yet it frequently harms performance due to semantic conflicts across heterogeneous memories. In this work, we introduce a clustering-based memory compression strategy that balances context efficiency and personalization quality. Our method groups memories by similarity and merges them within clusters prior to concatenation, thereby preserving coherence while reducing redundancy. Experiments demonstrate that our approach substantially lowers the number of memory tokens while outperforming baseline strategies such as naive averaging or direct concatenation. Furthermore, for a fixed context budget, clustering-driven merging yields more compact memory representations and consistently enhances generation quality.
Problem

Research questions and friction points this paper is trying to address.

memory compression
on-device LLMs
context limitation
semantic conflict
personalized generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

clustering
memory compression
on-device LLMs
personalization
context efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.