TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenges of deploying generative recommendation models on Ascend NPUs, which are hindered by inefficient irregular sparse operators and a mismatch between hardware architecture and dense-computation optimizations. To overcome these limitations, the authors propose an efficient training system tailored for Ascend, featuring several key innovations: jagged-aware operator fusion with FP16 quantization, hierarchical sparse parallelism integrated with dynamic load balancing, semi-asynchronous training coupled with fine-grained pipeline scheduling, and asynchronous negative sampling offloading with intra-batch logit sharing. Evaluated on the KuaiRand-27K dataset, the system enables training of a 0.2-billion-parameter model with a Model FLOPs Utilization (MFU) of 54.71%, achieves near-linear scaling efficiency of 0.97, and reduces inter-device load imbalance from 47% to 2.4%.

📝 Abstract

Generative recommendation (GR) has emerged as a promising paradigm that replaces fragmented, scenario-specific architectures with unified Transformer-based models, exhibiting scaling-law behavior where recommendation quality improves systematically with increased model capacity and training data. However, deploying GR at scale on Ascend NPUs faces fundamental system-level challenges. These challenges are further exacerbated on Ascend NPUs due to the absence of high-performance implementations for jagged operators and the architectural mismatch between irregular sparse primitives and NPU's dense-computation-optimized design. In this paper, we present \model, an Ascend-affinity training system for generative recommendation that systematically addresses these bottlenecks through three core innovations: (i) Ascend-affinity jagged acceleration, including fusion operators that eliminate padding redundancy and dynamic load balancing that reduces inter-device imbalance from 47\% to 2.4\%; (ii) distributed communication optimization, comprising hierarchical sparse parallelism, semi-asynchronous training with proven convergence guarantees, and fine-grained pipeline orchestration that sustains 94\% NPU utilization; and (iii) negative sampling optimization via asynchronous offloading, jaggedness-aware FP16 quantization, and intra-batch logit sharing that expand the effective negative space without additional embedding lookups. Evaluated on the KuaiRand-27K dataset, \model supports training at up to 0.2B parameters and achieves 54.71\% MFU with near-linear scalability (0.97).

Problem

Research questions and friction points this paper is trying to address.

Generative Recommendation

Ascend NPU

Jagged Operators

Sparse Computation

System-level Challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Recommendation

Ascend NPU

Jagged Operator Optimization