RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently processing long user behavior sequences in generative recommender systems under strict tail-latency constraints in production environments. To mitigate redundant computation along the critical path, the authors propose a cross-stage pipelined inference mechanism that precomputes and caches key-value (KV) caches of user behavior prefixes in high-bandwidth memory (HBM) during an early stage for reuse in the ranking stage. They design an industrial-scale cache reuse architecture featuring three core components: a sequence-aware trigger, affinity-aware routing, and a memory-aware scaler. Furthermore, the system is optimized for Ascend NPUs through tailored cache management and request scheduling strategies. Experimental results demonstrate that, under a fixed P99 latency budget, the proposed approach supports sequences 1.5× longer and achieves up to 3.6× higher SLO-compliant throughput.

Technology Category

Application Category

📝 Abstract
Real-time recommender systems execute multi-stage cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs, leaving only tens of milliseconds for ranking. Generative recommendation (GR) models can improve quality by consuming long user-behavior sequences, but in production their online sequence length is tightly capped by the ranking-stage P99 budget. We observe that the majority of GR tokens encode user behaviors that are independent of the item candidates, suggesting an opportunity to pre-infer a user-behavior prefix once and reuse it during ranking rather than recomputing it on the critical path. Realizing this idea at industrial scale is non-trivial: the prefix cache must survive across multiple pipeline stages before the final ranking instance is determined, the user population implies cache footprints far beyond a single device, and indiscriminate pre-inference would overload shared resources under high QPS. We present RelayGR, a production system that enables in-HBM relay-race inference for GR. RelayGR selectively pre-infers long-term user prefixes, keeps their KV caches resident in HBM over the request lifecycle, and ensures the subsequent ranking can consume them without remote fetches. RelayGR combines three techniques: 1) a sequence-aware trigger that admits only at-risk requests under a bounded cache footprint and pre-inference load, 2) an affinity-aware router that co-locates cache production and consumption by routing both the auxiliary pre-infer signal and the ranking request to the same instance, and 3) a memory-aware expander that uses server-local DRAM to capture short-term cross-request reuse while avoiding redundant reloads. We implement RelayGR on Huawei Ascend NPUs and evaluate it with real queries. Under a fixed P99 SLO, RelayGR supports up to 1.5$\times$ longer sequences and improves SLO-compliant throughput by up to 3.6$\times$.
Problem

Research questions and friction points this paper is trying to address.

generative recommendation
long-sequence modeling
tail-latency SLO
real-time recommender systems
sequence length limitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Recommendation
Relay-Race Inference
KV Cache Reuse
Tail-Latency Optimization
HBM Caching
🔎 Similar Papers
No similar papers found.
J
Jiarui Wang
Huawei Technologies Co., Ltd.
H
Hui Chai
Huawei Technologies Co., Ltd.
Yuanhang Zhang
Yuanhang Zhang
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionMulti-Modal LearningVisual Speech RecognitionVideo Understanding
Z
Zongjin Zhou
Huawei Technologies Co., Ltd.
W
Wei Guo
Huawei Technologies Co., Ltd.
X
Xingkun Yang
Huawei Technologies Co., Ltd.
Q
Qiang Tang
Huawei Technologies Co., Ltd.
B
Bo Pan
Huawei Technologies Co., Ltd.
J
Jiawei Zhu
Huawei Technologies Co., Ltd.
Ke Cheng
Ke Cheng
Xidian University
Secure Multi-Party Computation
Yuting Yan
Yuting Yan
Nanjing University
Edge IntelligenceAI SystemVideo Analytics System
S
Shulan Wang
Huawei Technologies Co., Ltd.
Yingjie Zhu
Yingjie Zhu
Harbin Institute of Technology, Shenzhen
Natural Language ProcessingVision-Language ModelsLarge Language ModelsFact Checking
Z
Zhengfan Yuan
Huawei Technologies Co., Ltd.
Jiaqi Huang
Jiaqi Huang
University of Central Missouri
CybersecurityIoV
Y
Yuhan Zhang
Huawei Technologies Co., Ltd.
X
Xiaosong Sun
Huawei Technologies Co., Ltd.
Zhinan Zhang
Zhinan Zhang
Professor at Shanghai Jiao Tong University
Engineering design/educationknowledge managementtribology
H
Hong Zhu
Huawei Technologies Co., Ltd.
Y
Yongsheng Zhang
Huawei Technologies Co., Ltd.
Tian Dong
Tian Dong
Shanghai Jiao Tong University
Computer SecurityMachine Learning
Z
Zhong Xiao
Huawei Technologies Co., Ltd.
D
Deliang Liu
Huawei Technologies Co., Ltd.
C
Chengzhou Lu
Huawei Technologies Co., Ltd.
Y
Yuanqiang Sun
Huawei Technologies Co., Ltd.
Z
Zhiyuan Chen
Huawei Technologies Co., Ltd.
X
Xinming Han
Huawei Technologies Co., Ltd.
Z
Zaizhu Liu
Huawei Technologies Co., Ltd.
Y
Yaoyuan Wang
Huawei Technologies Co., Ltd.
Z
Ziyang Zhang
Huawei Technologies Co., Ltd.
Yong Liu
Yong Liu
Huawei, NTU, I2R
Recommender SystemsData MiningMachine Learning
J
Jinxin Xu
Huawei Technologies Co., Ltd.
Y
Yajing Sun
Huawei Technologies Co., Ltd.
Z
Zhoujun Yu
Huawei Technologies Co., Ltd.
W
Wenting Zhou
Huawei Technologies Co., Ltd.
Q
Qidong Zhang
Huawei Technologies Co., Ltd.
Z
Zhengyong Zhang
Huawei Technologies Co., Ltd.
Z
Zhonghai Gu
Huawei Technologies Co., Ltd.
Yibo Jin
Yibo Jin
State Key Lab. for Novel Software Technol., Nanjing Univ., Nanjing, China
Distributed SystemMachine Learning
Yong Feng
Yong Feng
Swinburne University of Technology, Australia
Sliding Mode Control - Electrical Engineering - Control and Observers
Pengfei Zuo
Pengfei Zuo
Huawei
AI InfrastructureCloud InfrastructureMachine Learning SystemsMemory SystemsStorage Systems