Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Traditional LLM caching—e.g., exact-match or prefix caching—ignores semantic similarity, while existing semantic caches lack architectural innovation. To address this, we propose SISO, a novel semantic caching system. Methodologically, SISO introduces three key innovations: (i) centroid-based semantic clustering to eliminate token-level matching dependencies; (ii) locality-aware dynamic cache replacement; and (iii) adaptive similarity-threshold control. By integrating semantic embeddings, online clustering, and request locality modeling, SISO achieves up to 1.71× higher cache hit rates across multiple real-world LLM request datasets. It significantly improves SLO compliance and memory utilization efficiency. As a scalable, semantics-driven caching paradigm, SISO advances the state of the art for high-concurrency, low-latency LLM serving.

Technology Category

Application Category

📝 Abstract

Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71$ imes$ higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM serving efficiency under computational constraints

Overcoming limitations of traditional semantic caching strategies

Balancing accuracy and latency in dynamic workload environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Centroid-based caching maximizes coverage with minimal memory

Locality-aware replacement preserves high-value cache entries

Dynamic thresholding balances accuracy and latency under workloads

🔎 Similar Papers

D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models