🤖 AI Summary
Industrial training systems for generative recommendation models (GRMs) suffer from inefficient sparse embedding updates, GPU load imbalance, and suboptimal embedding lookup performance. To address these challenges, this paper introduces the first efficient and scalable system tailored for large-scale GRM training. Our method features: (1) a dynamic hash table replacing static embedding tables to enable real-time embedding insertion/deletion and low-latency lookup; (2) a dynamic sequence balancing strategy coupled with embedding ID deduplication and automatic table merging to mitigate long-tail distribution effects and redundant parameter updates; and (3) integration of mixed-precision training, gradient accumulation, operator fusion, and fault-tolerant checkpointing. Experiments demonstrate 1.6×–2.4× higher training throughput and near-linear scalability up to 100 GPUs. The system has been deployed in production at Meituan, serving over 100 million inference requests daily.
📝 Abstract
Recommendation is crucial for both user experience and company revenue, and generative recommendation models (GRMs) are shown to produce quality recommendations recently. However, existing systems are limited by insufficient functionality support and inefficient implementations for training GRMs in industrial scenarios. As such, we introduce MTGRBoost as an efficient and scalable system for GRM training. Specifically, to handle the real-time insert/delete of sparse embedding entries, MTGRBoost employs dynamic hash tables to replace static tables. To improve efficiency, MTGRBoost conducts dynamic sequence balancing to address the computation load imbalances among GPUs and adopts embedding ID deduplication alongside automatic table merging to accelerate embedding lookup. MTGRBoost also incorporates implementation optimizations including checkpoint resuming, mixed precision training, gradient accumulation, and operator fusion. Extensive experiments show that MTGRBoost improves training throughput by $1.6 imes$ -- $2.4 imes$ while achieving good scalability when running over 100 GPUs. MTGRBoost has been deployed for many applications in Meituan and is now handling hundreds of millions of requests on a daily basis.