TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing prefix caching systems tightly couple cache management with request scheduling, leading to imbalanced cross-instance workloads, data redundancy, and memory fragmentation. This paper introduces TokenLake—the first unified, segment-level prefix caching pool system—that decouples scheduling from cache management via a declarative caching interface, enabling fine-grained, elastic long-context serving. Its core innovations include: (1) segment-level cache partitioning; (2) a revisit-aware load balancing algorithm; and (3) query tensor communication optimization with transparent data transfer, collectively achieving cache deduplication, fragmentation resistance, and low-latency memory pooling. Under realistic workloads, TokenLake improves throughput by up to 2.6× and cache hit rate by 2.1× over state-of-the-art approaches, significantly enhancing caching efficiency and computational resource utilization.

Technology Category

Application Category

📝 Abstract

Prefix caching is crucial to accelerate multi-turn interactions and requests with shared prefixes. At the cluster level, existing prefix caching systems are tightly coupled with request scheduling to optimize cache efficiency and computation performance together, leading to load imbalance, data redundancy, and memory fragmentation of caching systems across instances. To address these issues, memory pooling is promising to shield the scheduler from the underlying cache management so that it can focus on the computation optimization. However, because existing prefix caching systems only transfer increasingly longer prefix caches between instances, they cannot achieve low-latency memory pooling. To address these problems, we propose a unified segment-level prefix cache pool, TokenLake. It uses a declarative cache interface to expose requests' query tensors, prefix caches, and cache-aware operations to TokenLake for efficient pooling. Powered by this abstraction, TokenLake can manage prefix cache at the segment level with a heavy-hitter-aware load balancing algorithm to achieve better cache load balance, deduplication, and defragmentation. TokenLake also transparently minimizes the communication volume of query tensors and new caches. Based on TokenLake, the scheduler can schedule requests elastically by using existing techniques without considering prefix cache management. Evaluations on real-world workloads show that TokenLake can improve throughput by up to 2.6$ imes$ and 2.0$ imes$ and boost hit rate by 2.0$ imes$ and 2.1$ imes$, compared to state-of-the-art cache-aware routing and cache-centric PD-disaggregation solutions, respectively.

Problem

Research questions and friction points this paper is trying to address.

Optimizing cluster-level prefix caching to reduce load imbalance

Minimizing data redundancy and memory fragmentation across instances

Achieving low-latency memory pooling for elastic LLM serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-level prefix cache pool for load balancing

Declarative cache interface for efficient pooling

Heavy-hitter-aware algorithm for cache optimization

🔎 Similar Papers

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations