ProactivePIM: Accelerating Weight-Sharing Embedding Layer with PIM for Scalable Recommendation System

📅 2024-02-06

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

In deep learning recommendation systems, sparse and irregular memory accesses to embedding layers cause severe memory bandwidth bottlenecks and high inter-chip communication overhead, critically limiting large-scale single-node inference. To address the unique memory access patterns of weight-shared embedding layers, this work proposes the first cache-enhanced prefetching and sub-table mapping co-optimization framework tailored for Processing-in-Memory (PIM) architectures. It pioneers the identification and exploitation of fine-grained memory locality induced by weight sharing, enabling low-overhead on-PIM embedding caching and accurate prefetching—thereby eliminating redundant CPU-PIM data transfers. Experimental evaluation demonstrates a 4.8× inference speedup over state-of-the-art PIM-based approaches, with substantial improvements in throughput and energy efficiency. This work establishes a novel, efficient, and scalable hardware-algorithm co-design paradigm for large-scale sparse model inference.

Technology Category

Application Category

📝 Abstract

The model size growth of personalized recommendation systems poses new challenges for inference. Weight-sharing algorithms have been proposed for size reduction, but they increase memory access. Recent advancements in processing-in-memory (PIM) enhanced the model throughput by exploiting memory parallelism, but such algorithms introduce massive CPU-PIM communication into prior PIM systems. We propose ProactivePIM, a PIM system for weight-sharing recommendation system acceleration. ProactivePIM integrates a cache within the PIM with a prefetching scheme to leverage a unique locality of the algorithm and eliminate communication overhead through a subtable mapping strategy. ProactivePIM achieves a 4.8x speedup compared to prior works.

Problem

Research questions and friction points this paper is trying to address.

Accelerating memory-intensive embedding layers in recommendation systems

Reducing CPU-PIM communication overhead in weight-sharing compression algorithms

Optimizing data locality for sparse embedding access patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

PIM system tailored for weight-sharing compression algorithms

Heterogeneous HBM-DIMM memory with two-level PIM architecture

Prefetches embeddings and eliminates CPU-PIM communication overhead

🔎 Similar Papers

Long-Sequence Recommendation Models Need Decoupled Embeddings