SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the challenge of deploying computationally intensive foundation models for real-time recommendation, where high inference costs often necessitate sacrificing model effectiveness for efficiency. Inspired by speculative decoding, we propose the first application of speculative precomputation to online recommendation inference: by predicting future user–item interactions, the system asynchronously pre-generates representations from the foundation model, decoupling expensive computation from the latency-sensitive serving path. Our approach integrates a latent representation speculation offloading mechanism with an embedding pre-generation strategy. Deployed in Meta’s advertising system—handling billions of daily requests—it achieves significant computational efficiency gains while simultaneously improving core revenue metrics by 0.67%.

Technology Category

Application Category

📝 Abstract
Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously considered too expensive for online use. Deployed across Meta's advertising system serving billions of daily requests, SOLARIS achieves 0.67% revenue-driving top-line metrics gain, demonstrating its effectiveness at scale.
Problem

Research questions and friction points this paper is trying to address.

foundation models
real-time inference
recommendation systems
computational efficiency
serving latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
foundation models
inference offloading
recommendation systems
real-time serving
Z
Zikun Liu
Meta AI
Liang Luo
Liang Luo
University of Washington
Systems for Machine LearningComputer SystemsComputer ArchitectureMachine Learning for Systems
Qianru Li
Qianru Li
University of California Los Angeles
Mobile systems5G/LTE
Z
Zhengyu Zhang
Meta AI
W
Wei Ling
Meta AI
Jingyi Shen
Jingyi Shen
The Ohio State University
Data VisualizationMachine Learning
Z
Zeliang Chen
Meta AI
Y
Yaning Huang
Meta AI
J
Jingxian Huang
Meta AI
A
Abdallah Aboelela
Meta AI
C
Chonglin Sun
Meta AI
F
Feifan Gu
Meta AI
F
Fenggang Wu
Meta AI
Hang Qu
Hang Qu
University of Liverpool
Huayu Li
Huayu Li
University of Arizona
Machine learninghealthcare informaticsmedical time seriesdigital health
J
Jill Pan
Meta AI
K
Kaidi Pei
Meta AI
Laming Chen
Laming Chen
Facebook
Recommender SystemOptimizationCompressive sensing
L
Longhao Jin
Meta AI
Q
Qin Huang
Meta AI
T
Tongyi Tang
Meta AI
V
Varna Puvvada
Meta AI
Wenlin Chen
Wenlin Chen
Meta Platforms
Machine LearningData MiningArtificial Intelligence
X
Xiaohan Wei
Meta AI
X
Xu Cao
Meta AI