RAC: Relation-Aware Cache Replacement for Large Language Models

πŸ“… 2026-02-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of balancing cost and latency in large language model (LLM) serving under limited cache capacity, where conventional caching strategies relying on recency and frequency signals exhibit unstable performance under real-world workloads. The paper proposes RAC, a semantic-aware online cache eviction policy that introduces semantic relatedness into LLM caching decisions for the first time. RAC employs an online learning framework that integrates topic modeling and graph-structured analysis to dynamically extract two novel signals: β€œtopic popularity,” capturing long-term reuse potential at the thematic level, and β€œstructural importance,” reflecting future reuse value within local dependency contexts. Experimental results demonstrate that RAC improves cache hit rates by 20%–30% over the strongest baselines across diverse real-world workloads, while exhibiting strong generalization and stability.

Technology Category

Application Category

πŸ“ Abstract
The scaling of Large Language Model (LLM) services faces significant cost and latency challenges, making effective caching under tight capacity crucial. Existing cache replacement policies, from heuristics to learning-based methods, predominantly rely on limited-window statistics such as recency and frequency. We show these signals are not robust for real-world LLM workloads, which exhibit long reuse distances and sparse local recurrence. To address these limitations, we propose Relation-Aware Cache (RAC), an online eviction strategy that leverages semantic relations among requests to guide eviction decisions. RAC synthesizes two relation-aware signals: (1) Topical Prevalence, which aggregates access evidence at the topic level to capture long-horizon reuse; and (2) Structural Importance, which leverages local intra-topic dependency structure to discriminate entries by their future reuse value. Extensive evaluations show that RAC maintains high effectiveness across diverse workloads, consistently surpassing state-of-the-art baselines by 20%--30% in cache hit ratio.
Problem

Research questions and friction points this paper is trying to address.

cache replacement
large language models
reuse distance
workload sparsity
semantic relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Relation-Aware Cache
Large Language Models
Cache Replacement
Semantic Relations
Topical Prevalence
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuchong Wu
Zhejiang University
Z
Zihuan Xu
Shenzhen Institute of Computing Sciences
W
Wangze Ni
Zhejiang University
P
Peng Cheng
Tongji University
Lei Chen
Lei Chen
Hong Kong University of Science and Technology
Human Powered Machine LearningDatabasesData Mining
X
Xuemin Lin
Shanghai Jiao Tong University
H
Heng Tao Shen
Tongji University
Kui Ren
Kui Ren
Professor and Dean of Computer Science, Zhejiang University, ACM/IEEE Fellow
Data Security & PrivacyAI SecurityIoT & Vehicular Security