🤖 AI Summary
To address sparsity, cold-start issues, training non-convergence (neutral metrics), and inference latency arising from large embedding tables in Pinterest’s ad ranking system, this paper proposes a multifaceted large-embedding-table framework. First, it introduces a multi-objective pretraining mechanism integrating contrastive learning, masked reconstruction, and graph-based collaborative modeling to mitigate representation degradation inherent in end-to-end training. Second, it designs a CPU-GPU hybrid inference architecture that alleviates memory bottlenecks while maintaining low end-to-end latency. Third, it enables efficient training and real-time serving of high-dimensional sparse features. Online A/B experiments demonstrate a 2.60% lift in CTR and a 1.34% reduction in CPC. The framework has been fully deployed in Pinterest’s production advertising system.
📝 Abstract
Large embedding tables are indispensable in modern recommendation systems, thanks to their ability to effectively capture and memorize intricate details of interactions among diverse entities. As we explore integrating large embedding tables into Pinterest's ads ranking models, we encountered not only common challenges such as sparsity and scalability, but also several obstacles unique to our context. Notably, our initial attempts to train large embedding tables from scratch resulted in neutral metrics. To tackle this, we introduced a novel multi-faceted pretraining scheme that incorporates multiple pretraining algorithms. This approach greatly enriched the embedding tables and resulted in significant performance improvements. As a result, the multi-faceted large embedding tables bring great performance gain on both the Click-Through Rate (CTR) and Conversion Rate (CVR) domains. Moreover, we designed a CPU-GPU hybrid serving infrastructure to overcome GPU memory limits and elevate the scalability. This framework has been deployed in the Pinterest Ads system and achieved 1.34% online CPC reduction and 2.60% CTR increase with neutral end-to-end latency change.