SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of deploying computationally intensive foundation models for real-time recommendation, where high inference costs often necessitate sacrificing model effectiveness for efficiency. Inspired by speculative decoding, we propose the first application of speculative precomputation to online recommendation inference: by predicting future user–item interactions, the system asynchronously pre-generates representations from the foundation model, decoupling expensive computation from the latency-sensitive serving path. Our approach integrates a latent representation speculation offloading mechanism with an embedding pre-generation strategy. Deployed in Meta’s advertising system—handling billions of daily requests—it achieves significant computational efficiency gains while simultaneously improving core revenue metrics by 0.67%.

Technology Category

Application Category

📝 Abstract

Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously considered too expensive for online use. Deployed across Meta's advertising system serving billions of daily requests, SOLARIS achieves 0.67% revenue-driving top-line metrics gain, demonstrating its effectiveness at scale.

Problem

Research questions and friction points this paper is trying to address.

foundation models

real-time inference

recommendation systems

computational efficiency

serving latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

foundation models

inference offloading