🤖 AI Summary
This work addresses the high inference overhead of generative recommendation models caused by repeatedly encoding long user histories and the GPU memory explosion risk from cross-request reuse of KV caches. To overcome these challenges, the authors propose a hierarchical caching architecture based on GPU memory virtualization, which leverages host memory as a scalable backing store. The design integrates a hybrid storage layout, asynchronous data transfer, and a locality-aware cache replacement policy to effectively circumvent GPU memory limitations. Experimental results on both public and production datasets demonstrate that the proposed approach achieves up to 3.1× speedup while maintaining a cache hit rate above 98.5%.
📝 Abstract
Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).