xGR: Efficient Generative Recommendation Serving at Scale

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative recommender systems (GRS) face severe latency bottlenecks in long-sequence modeling: large beam widths incur substantial decoding overhead; exhaustive ranking over the full item vocabulary is computationally expensive; and redundant KV caching coupled with insufficient pipeline parallelism hinders high-concurrency serving. To address these challenges, we propose a phased unified computation framework featuring three key innovations: (1) a decoupled KV cache management scheme that minimizes memory redundancy; (2) an early-stopping ranking algorithm that truncates costly full-vocabulary scoring; and (3) a mask-driven item filtering mechanism that prunes irrelevant candidates before scoring. Additionally, we design a multi-level overlapping and multi-stream parallel pipeline to maximize hardware utilization. Under stringent low-latency constraints, our system achieves over 3.49× higher throughput than state-of-the-art baselines on real-world recommendation datasets, significantly enhancing scalability and real-time responsiveness of GRS deployment.

Technology Category

Application Category

📝 Abstract
Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.
Problem

Research questions and friction points this paper is trying to address.

Optimizes generative recommendation serving for low latency
Reduces computational cost of beam search decoding
Minimizes sorting overhead in large item spaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies prefill and decode phases via staged computation
Enables early termination and filtering with data structure reuse
Reconstructs pipeline for multilevel overlap and parallelism
🔎 Similar Papers
No similar papers found.
Qingxiao Sun
Qingxiao Sun
China University of Petroleum, Beijing
GPU ArchitectureHPCDeep Learning
Tongxuan Liu
Tongxuan Liu
University of Science and Technology of China
LLM Logic ReasoningMulti-AgentsLLM Inference SystemLVLMRecommender System
Shen Zhang
Shen Zhang
MEGVII
Deep LearningComputer Vision
S
Siyu Wu
Beihang University
P
Peijun Yang
JD Company
H
Haotian Liang
University of Science and Technology Beijing
M
Menxin Li
JD Company
Xiaolong Ma
Xiaolong Ma
Assistant Professor, The University of Arizona
Deep LearningComputer VisionEfficient Learning SystemTrustworthy AI
Z
Zhiwei Liang
JD Company
Z
Ziyi Ren
JD Company
M
Minchao Zhang
JD Company
X
Xinyu Liu
Huawei
K
Ke Zhang
JD Company
D
Depei Qian
Beihang University
H
Hailong Yang
Beihang University