xGR: Efficient Generative Recommendation Serving at Scale

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Generative recommender systems (GRS) face severe latency bottlenecks in long-sequence modeling: large beam widths incur substantial decoding overhead; exhaustive ranking over the full item vocabulary is computationally expensive; and redundant KV caching coupled with insufficient pipeline parallelism hinders high-concurrency serving. To address these challenges, we propose a phased unified computation framework featuring three key innovations: (1) a decoupled KV cache management scheme that minimizes memory redundancy; (2) an early-stopping ranking algorithm that truncates costly full-vocabulary scoring; and (3) a mask-driven item filtering mechanism that prunes irrelevant candidates before scoring. Additionally, we design a multi-level overlapping and multi-stream parallel pipeline to maximize hardware utilization. Under stringent low-latency constraints, our system achieves over 3.49× higher throughput than state-of-the-art baselines on real-world recommendation datasets, significantly enhancing scalability and real-time responsiveness of GRS deployment.

Technology Category

Application Category

📝 Abstract

Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.

Problem

Research questions and friction points this paper is trying to address.

Optimizes generative recommendation serving for low latency

Reduces computational cost of beam search decoding

Minimizes sorting overhead in large item spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies prefill and decode phases via staged computation

Enables early termination and filtering with data structure reuse

Reconstructs pipeline for multilevel overlap and parallelism

🔎 Similar Papers

TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation