🤖 AI Summary
Traditional LLM caching—e.g., exact-match or prefix caching—ignores semantic similarity, while existing semantic caches lack architectural innovation. To address this, we propose SISO, a novel semantic caching system. Methodologically, SISO introduces three key innovations: (i) centroid-based semantic clustering to eliminate token-level matching dependencies; (ii) locality-aware dynamic cache replacement; and (iii) adaptive similarity-threshold control. By integrating semantic embeddings, online clustering, and request locality modeling, SISO achieves up to 1.71× higher cache hit rates across multiple real-world LLM request datasets. It significantly improves SLO compliance and memory utilization efficiency. As a scalable, semantics-driven caching paradigm, SISO advances the state of the art for high-concurrency, low-latency LLM serving.
📝 Abstract
Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71$ imes$ higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.