On the Practice of Scaling Search Conversion Rate Prediction

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of scaling conversion rate (CVR) prediction models in high-traffic scenarios, where trade-offs among model quality, training cost, and serving latency are critical. Through empirical analysis, the authors find that scaling effects from backbone architecture, embedding parameter size, and training data volume are approximately independent and additive. Leveraging this insight, they propose a lightweight warm-start strategy to accelerate training and integrate inference optimizations—including decoupled graph execution and dynamic batching—to enable low-latency GPU deployment of high-capacity models. In online A/B experiments with 2.5× more training data and 8× greater inference compute, the approach achieves a 2.6% improvement in key CVR metrics with negligible latency overhead.

📝 Abstract

Scaling a Search Conversion Rate (CVR) prediction model, especially in high-traffic environments, presents a challenge: superior model quality needs to be balanced with strict constraints on training cost and serving latency. This paper details an effective approach for scaling modern search CVR prediction models. We begin with an empirical study to understand the scaling performance of search CVR models, analyzing how quality improves as we scale three key factors of model backbone computation, the size of embedding parameters, and the volume of training data. We use a large-scale production dataset, comprising over a year of customer interaction logs from a high-traffic e-commerce platform, to evaluate the scalability of several state-of-the-art architectures and their ensembles. Our key findings are: (1) selecting the right backbone and scaling factors is crucial; (2) the impact of scaling backbone, embedding, and data is largely independent and additive, which has implications for more efficient scaling exploration; (3) a streamlined warmstart strategy can accelerate training iterations while simplifying new updates; (4) inference optimization strategies such as decoupled graph execution and dynamic batching can enable low-latency GPU serving even for high-capacity models. Compared to a baseline of a pre-scaling production model, we ultimately deployed a model trained on 2.5x larger training data with 8x more inference compute while having minimal latency impact. Online A/B tests also demonstrate that our launches achieved a combined +2.6% gain in a key metric of search conversion rate.

Problem

Research questions and friction points this paper is trying to address.

Search Conversion Rate

Model Scaling

Serving Latency

Training Cost

Large-scale Prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Search Conversion Rate Prediction

Model Scaling

Inference Optimization