MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

249K/year
🤖 AI Summary
This work addresses the high inference overhead of generative recommendation models caused by repeatedly encoding long user histories and the GPU memory explosion risk from cross-request reuse of KV caches. To overcome these challenges, the authors propose a hierarchical caching architecture based on GPU memory virtualization, which leverages host memory as a scalable backing store. The design integrates a hybrid storage layout, asynchronous data transfer, and a locality-aware cache replacement policy to effectively circumvent GPU memory limitations. Experimental results on both public and production datasets demonstrate that the proposed approach achieves up to 3.1× speedup while maintaining a cache hit rate above 98.5%.

Technology Category

Application Category

📝 Abstract
Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).
Problem

Research questions and friction points this paper is trying to address.

Generative Recommendation
KV Cache
Memory Explosion
GPU Memory Limitation
Inference Cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical caching
KV cache reuse
memory virtualization
generative recommendation
asynchronous data transfer
🔎 Similar Papers
No similar papers found.