PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

To address the challenges of modeling user behavior sequences in billion-scale visual discovery platforms—namely, achieving high throughput (millions of QPS), low latency, cross-application feature interaction, and effective cold-start recommendations for new items—this paper proposes DCAT (Decoupled Cross-Attention Transformer), a foundational model built upon a >20B-parameter Transformer architecture. DCAT introduces a novel *deduplicated cross-attention mechanism*, enabling scalable pretraining on massive user behavior sequences and efficient fine-tuning, while explicitly modeling dynamic interactions between new items and historical user behaviors. In industrial deployment, DCAT achieves a 6× improvement in inference throughput and a 20% increase in engagement rate for new items. It has been stably serving over 500 million users and integrated across multiple recommendation scenarios.

Technology Category

Application Category

📝 Abstract

User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than a half billion users across various applications.

Problem

Research questions and friction points this paper is trying to address.

Scaling transformer models for billion-scale user activity sequences

Handling new items not seen during pretraining in recommendations

Meeting strict latency and cost constraints in industrial systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrained transformer model with 20B+ parameters

Deduplicated Cross-Attention Transformer (DCAT) optimization

Altering input sequences to learn item interactions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs