🤖 AI Summary
This work addresses the high inference latency of large language models (LLMs) in generative recommendation systems, primarily caused by lengthy personalized prompts that hinder industrial deployment. To overcome this challenge, the authors propose RcLLM, a novel system that accelerates LLM inference through mechanisms surpassing conventional prefix caching. RcLLM decomposes prompts into non-contiguous, reusable segments and integrates a similarity-aware chunking strategy, a hierarchical distributed key-value (KV) cache, a global affinity-aware scheduler, and a selective attention mechanism. Experimental results on real-world datasets demonstrate that RcLLM reduces time-to-first-token by 1.31× to 9.51× while maintaining near-zero accuracy degradation, thereby enabling efficient and real-time generative recommendations.
📝 Abstract
Large Language Models (LLMs) are transforming recommendation from ranking into a generative task, but industrial deployment remains limited by the high latency of processing long, personalized prompts. Standard prefix caching provides limited benefit because reuse in recommendation workloads is often non-contiguous across user histories and item contexts. We present RcLLM, a distributed inference system for generative recommendation with Beyond-Prefix KV Caching. RcLLM decomposes prompts into reusable blocks and supports large item catalogs with a stratified distributed storage design: compact user-history caches are replicated for zero-latency retrieval, while massive item caches are sharded using similarity-aware placement. To reduce redundant quadratic attention computation, RcLLM combines an affinity-based global scheduler that improves data locality with a selective attention mechanism that corrects approximation errors. Experiments on real-world datasets show that RcLLM reduces Time-To-First-Token (TTFT) by 1.31x-9.51x compared with state-of-the-art prefix caching systems, enabling real-time serving with negligible impact on recommendation accuracy.