Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates

📅 2025-12-13

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

DLRM deployments suffer from minute-scale model staleness due to embedding table (EMT) synchronization latency across clusters, severely degrading recommendation freshness and business revenue. To address this, we propose an inference-side dynamic low-rank update framework: (1) integrating a LoRA trainer directly into inference nodes to eliminate cross-cluster synchronization; (2) introducing a singular-value-driven dynamic rank pruning mechanism for fine-grained update compression; and (3) employing NUMA-aware QoS scheduling to isolate update and inference resource contention. The solution incurs <2% memory overhead, reduces update cost by 2×, improves AUC by 0.04–0.24% within one hour, adds <20 ms to P99 latency, and increases inference CPU utilization via efficient reuse of idle resources. Our core contribution is the first deep integration of lightweight, adaptive training into the inference path—enabling zero-synchronization, low-overhead, high-reliability real-time model evolution.

Technology Category

Application Category

📝 Abstract

Deep Learning Recommendation Models (DLRMs) underpin personalized services but face a critical freshness-accuracy tradeoff due to massive parameter synchronization overheads. Production DLRMs deploy decoupled training/inference clusters, where synchronizing petabyte-scale embedding tables (EMTs) causes multi-minute staleness, degrading recommendation quality and revenue. We observe that (1) inference nodes exhibit sustained CPU underutilization (peak <= 20%), and (2) EMT gradients possess intrinsic low-rank structure, enabling compact update representation. We present LiveUpdate, a system that eliminates inter-cluster synchronization by colocating Low-Rank Adaptation (LoRA) trainers within inference nodes. LiveUpdate addresses two core challenges: (1) dynamic rank adaptation via singular value monitoring to constrain memory overhead (<2% of EMTs), and (2) NUMA-aware resource scheduling with hardware-enforced QoS to eliminate update inference contention (P99 latency impact <20ms). Evaluations show LiveUpdate reduces update costs by 2x versus delta-update baselines while achieving higher accuracy within 1-hour windows. By transforming idle inference resources into freshness engines, LiveUpdate delivers online model updates while outperforming state-of-the-art delta-update methods by 0.04% to 0.24% in accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reduces synchronization overhead for recommendation model updates

Enables near-zero-overhead freshness using inference-side resources

Improves accuracy by leveraging low-rank adaptation techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Low-Rank Adaptation trainers on inference nodes

Implements dynamic rank adaptation to limit memory overhead

Applies NUMA-aware scheduling to prevent update interference

🔎 Similar Papers

Retrieval and Distill: A Temporal Data Shift-Free Paradigm for Online Recommendation System