π€ AI Summary
DLRM deployments suffer from minute-scale model staleness due to embedding table (EMT) synchronization latency across clusters, severely degrading recommendation freshness and business revenue. To address this, we propose an inference-side dynamic low-rank update framework: (1) integrating a LoRA trainer directly into inference nodes to eliminate cross-cluster synchronization; (2) introducing a singular-value-driven dynamic rank pruning mechanism for fine-grained update compression; and (3) employing NUMA-aware QoS scheduling to isolate update and inference resource contention. The solution incurs <2% memory overhead, reduces update cost by 2Γ, improves AUC by 0.04β0.24% within one hour, adds <20 ms to P99 latency, and increases inference CPU utilization via efficient reuse of idle resources. Our core contribution is the first deep integration of lightweight, adaptive training into the inference pathβenabling zero-synchronization, low-overhead, high-reliability real-time model evolution.
π Abstract
Deep Learning Recommendation Models (DLRMs) underpin personalized services but face a critical freshness-accuracy tradeoff due to massive parameter synchronization overheads. Production DLRMs deploy decoupled training/inference clusters, where synchronizing petabyte-scale embedding tables (EMTs) causes multi-minute staleness, degrading recommendation quality and revenue. We observe that (1) inference nodes exhibit sustained CPU underutilization (peak <= 20%), and (2) EMT gradients possess intrinsic low-rank structure, enabling compact update representation. We present LiveUpdate, a system that eliminates inter-cluster synchronization by colocating Low-Rank Adaptation (LoRA) trainers within inference nodes. LiveUpdate addresses two core challenges: (1) dynamic rank adaptation via singular value monitoring to constrain memory overhead (<2% of EMTs), and (2) NUMA-aware resource scheduling with hardware-enforced QoS to eliminate update inference contention (P99 latency impact <20ms). Evaluations show LiveUpdate reduces update costs by 2x versus delta-update baselines while achieving higher accuracy within 1-hour windows. By transforming idle inference resources into freshness engines, LiveUpdate delivers online model updates while outperforming state-of-the-art delta-update methods by 0.04% to 0.24% in accuracy.