Clustering-driven Memory Compression for On-device Large Language Models

📅 2026-01-24

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of on-device large language models, where naively concatenating user memories quickly exhausts limited context windows, while simple averaging for memory compression suffers from semantic conflicts that degrade personalized generation. To overcome this, the paper introduces— for the first time—a clustering-based memory compression mechanism tailored for on-device settings. It groups memories according to the similarity of their embeddings, fuses memories within each cluster, and then integrates the compressed representations into the prompt in a context-aware manner. This approach substantially reduces the number of memory tokens while effectively mitigating semantic interference. Under a fixed context budget, it consistently outperforms both direct concatenation and average-based compression baselines, achieving a favorable balance between computational efficiency and the quality of personalized generation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often rely on user-specific memories distilled from past interactions to enable personalized generation. A common practice is to concatenate these memories with the input prompt, but this approach quickly exhausts the limited context available in on-device LLMs. Compressing memories by averaging can mitigate context growth, yet it frequently harms performance due to semantic conflicts across heterogeneous memories. In this work, we introduce a clustering-based memory compression strategy that balances context efficiency and personalization quality. Our method groups memories by similarity and merges them within clusters prior to concatenation, thereby preserving coherence while reducing redundancy. Experiments demonstrate that our approach substantially lowers the number of memory tokens while outperforming baseline strategies such as naive averaging or direct concatenation. Furthermore, for a fixed context budget, clustering-driven merging yields more compact memory representations and consistently enhances generation quality.

Problem

Research questions and friction points this paper is trying to address.

memory compression

on-device LLMs

context limitation

semantic conflict

personalized generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

clustering

memory compression

on-device LLMs