🤖 AI Summary
Existing CPU-centric approximate nearest neighbor (ANN) indexing and filtering solutions for DLRM-based large-scale recommendation systems suffer from high computational overhead and inability to jointly optimize indexing and model inference, leading to suboptimal GPU service efficiency.
Method: We propose the first end-to-end GPU-native recommendation serving system featuring: (1) a GPU-native Bloom index coupled with an integrated Int8 nearest-neighbor search kernel, reducing memory footprint via dual-index collaboration; (2) an OverArch scoring layer and Value Model enabling multi-task retrieval and similarity learning; and (3) unified embedding caching, model-aware indexing, and joint modeling for fully model-driven serving.
Results: Evaluated on industrial datasets, our system achieves 5.6× lower latency, 23.7× higher throughput, and 13.35× better cost efficiency versus CPU-based baselines, and currently serves hundreds of models for billions of users.
📝 Abstract
Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing systems rely on CPU-based ANN indexing and filtering services, suffering from non-negligible costs and forgoing joint optimization opportunities. Such inefficiency makes them difficult to support more complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we propose SilverTorch, a model-based system for serving recommendation models on GPUs. SilverTorch unifies model serving by replacing standalone indexing and filtering services with layers of served models. We propose a Bloom index algorithm on GPUs for feature filtering and a tensor-native fused Int8 ANN kernel on GPUs for nearest neighbor search. We further co-design the ANN search index and filtering index to reduce GPU memory utilization and eliminate unnecessary computation. Benefit from SilverTorch's serving paradigm, we introduce a OverArch scoring layer and a Value Model to aggregate results across multi-tasks. These advancements improve the accuracy for retrieval and enable future studies for serving more complex models. For ranking, SilverTorch's design accelerates item embedding calculation by caching the pre-calculated embeddings inside the serving model. Our evaluation on the industry-scale datasets show that SilverTorch achieves up to 5.6x lower latency and 23.7x higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch's solution is 13.35x more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch serves over hundreds of models online across major products and recommends contents for billions of daily active users.