Disaggregating Embedding Recommendation Systems with FlexEMR

📅 2024-09-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Embedded recommendation systems (EMRs) suffer from severe memory bloat, low resource utilization, and high network overhead after decoupling embedding lookups from neural network inference. To address these challenges, this paper proposes FlexEMR—the first architecture to efficiently decouple embedding queries from DNN inference. Its key contributions are: (1) a spatiotemporal locality-aware data movement compression mechanism that significantly reduces redundant network transfers; (2) a high-concurrency embedding lookup engine leveraging multithreaded RDMA to overcome bandwidth and latency bottlenecks; and (3) a joint scheduling strategy combining embedding table sharding with dynamic caching. Prototype evaluation demonstrates that FlexEMR reduces network traffic substantially, improves embedding lookup throughput by 2.3×, and decreases end-to-end latency by 37%, all while preserving model accuracy—thereby markedly enhancing system resource efficiency and scalability.

Technology Category

Application Category

📝 Abstract

Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their increasingly large memory requirements. Today's practice splits the model across many monolithic servers, where a mix of GPUs, CPUs, and DRAM is provisioned in fixed proportions. This approach leads to suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference is a promising solution but raises novel networking challenges. In this paper, we discuss the design of FlexEMR for optimized EMR disaggregation. FlexEMR proposes two sets of techniques to tackle the networking challenges: Leveraging the temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests. We outline the design space for each technique and present initial results from our early prototype.

Problem

Research questions and friction points this paper is trying to address.

Resource Allocation

Network Efficiency

Cost Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

FlexEMR

Network Optimization

Resource Allocation Efficiency

🔎 Similar Papers

Long-Sequence Recommendation Models Need Decoupled Embeddings