🤖 AI Summary
To address three key challenges in Google’s online advertising recommendation and bidding scoring—inefficient input generation, bottlenecks in large-scale embedding table processing, and resource waste due to fault-tolerant scheduling—this work proposes an end-to-end training infrastructure optimization. Methodologically, it introduces: (1) a novel shared input generation mechanism that unifies raw data-to-numerical feature conversion; (2) a co-designed framework integrating embedding table partitioning, pipelined execution, and RPC-based aggregation; and (3) a preemption-aware scheduler enabling training suspension and rapid resumption. The system incorporates TPU acceleration, sparse-to-dense feature conversion, distributed RPC optimization, and streaming data injection. Evaluated in production, the solution achieves a 116% increase in training throughput and an 18% reduction in unit cost, demonstrating significant scalability and efficiency gains for large-scale ad ranking models.
📝 Abstract
Large-scale Ads recommendation and auction scoring models at Google scale demand immense computational resources. While specialized hardware like TPUs have improved linear algebra computations, bottlenecks persist in large-scale systems. This paper proposes solutions for three critical challenges that must be addressed for efficient end-to-end execution in a widely used production infrastructure: (1) Input Generation and Ingestion Pipeline: Efficiently transforming raw features (e.g.,"search query") into numerical inputs and streaming them to TPUs; (2) Large Embedding Tables: Optimizing conversion of sparse features into dense floating-point vectors for neural network consumption; (3) Interruptions and Error Handling: Minimizing resource wastage in large-scale shared datacenters. To tackle these challenges, we propose a shared input generation technique to reduce computational load of input generation by amortizing costs across many models. Furthermore, we propose partitioning, pipelining, and RPC (Remote Procedure Call) coalescing software techniques to optimize embedding operations. To maintain efficiency at scale, we describe novel preemption notice and training hold mechanisms that minimize resource wastage, and ensure prompt error resolution. These techniques have demonstrated significant improvement in Google production, achieving a 116% performance boost and an 18% reduction in training costs across representative models.