🤖 AI Summary
This work addresses the challenge of scaling sequential modeling in large-scale advertising recommendation systems under stringent latency constraints. The authors propose a scalable two-stage Transformer architecture: an upstream module asynchronously constructs rich user representations incorporating long-context and deep structural information, while a lightweight downstream model enables real-time inference. The study is the first to reveal that sequential modeling in recommendation systems follows a power-law scaling law analogous to that observed in large language models, and identifies semantic features as a critical prerequisite for effective scaling. Deployed at Meta as the largest user model to date, the approach achieves a 4.3% lift in conversion rates on Facebook Feed and Reels with minimal serving overhead.
📝 Abstract
We present LLaTTE (LLM-Style Latent Transformers for Temporal Events), a scalable transformer architecture for production ads recommendation. Through systematic experiments, we demonstrate that sequence modeling in recommendation systems follows predictable power-law scaling similar to LLMs. Crucially, we find that semantic features bend the scaling curve: they are a prerequisite for scaling, enabling the model to effectively utilize the capacity of deeper and longer architectures. To realize the benefits of continued scaling under strict latency constraints, we introduce a two-stage architecture that offloads the heavy computation of large, long-context models to an asynchronous upstream user model. We demonstrate that upstream improvements transfer predictably to downstream ranking tasks. Deployed as the largest user model at Meta, this multi-stage framework drives a 4.3\% conversion uplift on Facebook Feed and Reels with minimal serving overhead, establishing a practical blueprint for harnessing scaling laws in industrial recommender systems.