RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenge of efficiently processing long user behavior sequences in generative recommender systems under strict tail-latency constraints in production environments. To mitigate redundant computation along the critical path, the authors propose a cross-stage pipelined inference mechanism that precomputes and caches key-value (KV) caches of user behavior prefixes in high-bandwidth memory (HBM) during an early stage for reuse in the ranking stage. They design an industrial-scale cache reuse architecture featuring three core components: a sequence-aware trigger, affinity-aware routing, and a memory-aware scaler. Furthermore, the system is optimized for Ascend NPUs through tailored cache management and request scheduling strategies. Experimental results demonstrate that, under a fixed P99 latency budget, the proposed approach supports sequences 1.5× longer and achieves up to 3.6× higher SLO-compliant throughput.

Technology Category

Application Category

📝 Abstract

Real-time recommender systems execute multi-stage cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs, leaving only tens of milliseconds for ranking. Generative recommendation (GR) models can improve quality by consuming long user-behavior sequences, but in production their online sequence length is tightly capped by the ranking-stage P99 budget. We observe that the majority of GR tokens encode user behaviors that are independent of the item candidates, suggesting an opportunity to pre-infer a user-behavior prefix once and reuse it during ranking rather than recomputing it on the critical path. Realizing this idea at industrial scale is non-trivial: the prefix cache must survive across multiple pipeline stages before the final ranking instance is determined, the user population implies cache footprints far beyond a single device, and indiscriminate pre-inference would overload shared resources under high QPS. We present RelayGR, a production system that enables in-HBM relay-race inference for GR. RelayGR selectively pre-infers long-term user prefixes, keeps their KV caches resident in HBM over the request lifecycle, and ensures the subsequent ranking can consume them without remote fetches. RelayGR combines three techniques: 1) a sequence-aware trigger that admits only at-risk requests under a bounded cache footprint and pre-inference load, 2) an affinity-aware router that co-locates cache production and consumption by routing both the auxiliary pre-infer signal and the ranking request to the same instance, and 3) a memory-aware expander that uses server-local DRAM to capture short-term cross-request reuse while avoiding redundant reloads. We implement RelayGR on Huawei Ascend NPUs and evaluate it with real queries. Under a fixed P99 SLO, RelayGR supports up to 1.5$\times$ longer sequences and improves SLO-compliant throughput by up to 3.6$\times$.

Problem

Research questions and friction points this paper is trying to address.

generative recommendation

long-sequence modeling

tail-latency SLO

real-time recommender systems

sequence length limitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Recommendation

Relay-Race Inference

KV Cache Reuse