🤖 AI Summary
Generative recommender systems (GRS) face severe latency bottlenecks in long-sequence modeling: large beam widths incur substantial decoding overhead; exhaustive ranking over the full item vocabulary is computationally expensive; and redundant KV caching coupled with insufficient pipeline parallelism hinders high-concurrency serving. To address these challenges, we propose a phased unified computation framework featuring three key innovations: (1) a decoupled KV cache management scheme that minimizes memory redundancy; (2) an early-stopping ranking algorithm that truncates costly full-vocabulary scoring; and (3) a mask-driven item filtering mechanism that prunes irrelevant candidates before scoring. Additionally, we design a multi-level overlapping and multi-stream parallel pipeline to maximize hardware utilization. Under stringent low-latency constraints, our system achieves over 3.49× higher throughput than state-of-the-art baselines on real-world recommendation datasets, significantly enhancing scalability and real-time responsiveness of GRS deployment.
📝 Abstract
Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.