SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing CPU-centric approximate nearest neighbor (ANN) indexing and filtering solutions for DLRM-based large-scale recommendation systems suffer from high computational overhead and inability to jointly optimize indexing and model inference, leading to suboptimal GPU service efficiency. Method: We propose the first end-to-end GPU-native recommendation serving system featuring: (1) a GPU-native Bloom index coupled with an integrated Int8 nearest-neighbor search kernel, reducing memory footprint via dual-index collaboration; (2) an OverArch scoring layer and Value Model enabling multi-task retrieval and similarity learning; and (3) unified embedding caching, model-aware indexing, and joint modeling for fully model-driven serving. Results: Evaluated on industrial datasets, our system achieves 5.6× lower latency, 23.7× higher throughput, and 13.35× better cost efficiency versus CPU-based baselines, and currently serves hundreds of models for billions of users.

Technology Category

Application Category

📝 Abstract
Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing systems rely on CPU-based ANN indexing and filtering services, suffering from non-negligible costs and forgoing joint optimization opportunities. Such inefficiency makes them difficult to support more complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we propose SilverTorch, a model-based system for serving recommendation models on GPUs. SilverTorch unifies model serving by replacing standalone indexing and filtering services with layers of served models. We propose a Bloom index algorithm on GPUs for feature filtering and a tensor-native fused Int8 ANN kernel on GPUs for nearest neighbor search. We further co-design the ANN search index and filtering index to reduce GPU memory utilization and eliminate unnecessary computation. Benefit from SilverTorch's serving paradigm, we introduce a OverArch scoring layer and a Value Model to aggregate results across multi-tasks. These advancements improve the accuracy for retrieval and enable future studies for serving more complex models. For ranking, SilverTorch's design accelerates item embedding calculation by caching the pre-calculated embeddings inside the serving model. Our evaluation on the industry-scale datasets show that SilverTorch achieves up to 5.6x lower latency and 23.7x higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch's solution is 13.35x more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch serves over hundreds of models online across major products and recommends contents for billions of daily active users.
Problem

Research questions and friction points this paper is trying to address.

Serving large-scale deep learning recommendation models efficiently on GPUs
Eliminating CPU-based ANN indexing inefficiencies and high operational costs
Enabling complex model architectures with joint optimization opportunities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model-based serving system on GPUs
GPU-based Bloom index for feature filtering
Tensor-native fused Int8 ANN kernel for search
🔎 Similar Papers
No similar papers found.
B
Bi Xue
Meta Platforms
H
Hong Wu
Meta Platforms
L
Lei Chen
Meta Platforms
C
Chao Yang
Meta Platforms
Y
Yiming Ma
Meta Platforms
Fei Ding
Fei Ding
Unknown affiliation
Z
Zhen Wang
Meta Platforms
L
Liang Wang
Meta Platforms
X
Xiaoheng Mao
Meta Platforms
K
Ke Huang
Meta Platforms
X
Xialu Li
Meta Platforms
Peng Xia
Peng Xia
PhD student, Department of Computer Science, UNC Chapel Hill
Multimodal AgentHealthcare
R
Rui Jian
Meta Platforms
Y
Yanli Zhao
Meta Platforms
Y
Yanzun Huang
Meta Platforms
Y
Yijie Deng
Meta Platforms
Harry Tran
Harry Tran
University of Minnesota
Non Invasive Brain Stimulationmultiscale modelingsingle unit activity
R
Ryan Chang
Meta Platforms
M
Min Yu
Meta Platforms
E
Eric Dong
Meta Platforms
J
Jiazhou Wang
Meta Platforms
Q
Qianqian Zhang
Meta Platforms
Keke Zhai
Keke Zhai
Unknown affiliation
HPCparallel computing
H
Hongzhang Yin
Meta Platforms
P
P. Garbacki
Fireworks AI
Z
Zheng Fang
Meta Platforms
Y
Yiyi Pan
Meta Platforms
M
Min Ni
Meta Platforms
Y
Yang Liu
Meta Platforms