SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing CPU-centric approximate nearest neighbor (ANN) indexing and filtering solutions for DLRM-based large-scale recommendation systems suffer from high computational overhead and inability to jointly optimize indexing and model inference, leading to suboptimal GPU service efficiency. Method: We propose the first end-to-end GPU-native recommendation serving system featuring: (1) a GPU-native Bloom index coupled with an integrated Int8 nearest-neighbor search kernel, reducing memory footprint via dual-index collaboration; (2) an OverArch scoring layer and Value Model enabling multi-task retrieval and similarity learning; and (3) unified embedding caching, model-aware indexing, and joint modeling for fully model-driven serving. Results: Evaluated on industrial datasets, our system achieves 5.6× lower latency, 23.7× higher throughput, and 13.35× better cost efficiency versus CPU-based baselines, and currently serves hundreds of models for billions of users.

Technology Category

Application Category

📝 Abstract

Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing systems rely on CPU-based ANN indexing and filtering services, suffering from non-negligible costs and forgoing joint optimization opportunities. Such inefficiency makes them difficult to support more complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we propose SilverTorch, a model-based system for serving recommendation models on GPUs. SilverTorch unifies model serving by replacing standalone indexing and filtering services with layers of served models. We propose a Bloom index algorithm on GPUs for feature filtering and a tensor-native fused Int8 ANN kernel on GPUs for nearest neighbor search. We further co-design the ANN search index and filtering index to reduce GPU memory utilization and eliminate unnecessary computation. Benefit from SilverTorch's serving paradigm, we introduce a OverArch scoring layer and a Value Model to aggregate results across multi-tasks. These advancements improve the accuracy for retrieval and enable future studies for serving more complex models. For ranking, SilverTorch's design accelerates item embedding calculation by caching the pre-calculated embeddings inside the serving model. Our evaluation on the industry-scale datasets show that SilverTorch achieves up to 5.6x lower latency and 23.7x higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch's solution is 13.35x more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch serves over hundreds of models online across major products and recommends contents for billions of daily active users.

Problem

Research questions and friction points this paper is trying to address.

Serving large-scale deep learning recommendation models efficiently on GPUs

Eliminating CPU-based ANN indexing inefficiencies and high operational costs

Enabling complex model architectures with joint optimization opportunities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model-based serving system on GPUs

GPU-based Bloom index for feature filtering

Tensor-native fused Int8 ANN kernel for search

🔎 Similar Papers

No similar papers found.