🤖 AI Summary
This work addresses the challenge of efficiently processing long user behavior sequences in generative recommender systems under strict tail-latency constraints in production environments. To mitigate redundant computation along the critical path, the authors propose a cross-stage pipelined inference mechanism that precomputes and caches key-value (KV) caches of user behavior prefixes in high-bandwidth memory (HBM) during an early stage for reuse in the ranking stage. They design an industrial-scale cache reuse architecture featuring three core components: a sequence-aware trigger, affinity-aware routing, and a memory-aware scaler. Furthermore, the system is optimized for Ascend NPUs through tailored cache management and request scheduling strategies. Experimental results demonstrate that, under a fixed P99 latency budget, the proposed approach supports sequences 1.5× longer and achieves up to 3.6× higher SLO-compliant throughput.
📝 Abstract
Real-time recommender systems execute multi-stage cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs, leaving only tens of milliseconds for ranking. Generative recommendation (GR) models can improve quality by consuming long user-behavior sequences, but in production their online sequence length is tightly capped by the ranking-stage P99 budget. We observe that the majority of GR tokens encode user behaviors that are independent of the item candidates, suggesting an opportunity to pre-infer a user-behavior prefix once and reuse it during ranking rather than recomputing it on the critical path. Realizing this idea at industrial scale is non-trivial: the prefix cache must survive across multiple pipeline stages before the final ranking instance is determined, the user population implies cache footprints far beyond a single device, and indiscriminate pre-inference would overload shared resources under high QPS. We present RelayGR, a production system that enables in-HBM relay-race inference for GR. RelayGR selectively pre-infers long-term user prefixes, keeps their KV caches resident in HBM over the request lifecycle, and ensures the subsequent ranking can consume them without remote fetches. RelayGR combines three techniques: 1) a sequence-aware trigger that admits only at-risk requests under a bounded cache footprint and pre-inference load, 2) an affinity-aware router that co-locates cache production and consumption by routing both the auxiliary pre-infer signal and the ranking request to the same instance, and 3) a memory-aware expander that uses server-local DRAM to capture short-term cross-request reuse while avoiding redundant reloads. We implement RelayGR on Huawei Ascend NPUs and evaluate it with real queries. Under a fixed P99 SLO, RelayGR supports up to 1.5$\times$ longer sequences and improves SLO-compliant throughput by up to 3.6$\times$.