π€ AI Summary
This work addresses the challenge of simultaneously achieving high throughput, low latency, and strong retrieval quality in sponsored search. To this end, the authors propose a production-oriented three-stage training paradigm: first, a billion-parameter small language model (SLM) is fine-tuned as a teacher; second, its knowledge is distilled into a lightweight encoder with fewer than 600 million parameters via L2 representation alignment; and third, the student model is further refined through contrastive learning. The study systematically investigates key design factors, including embedding dimensionality, model scale, and optimization strategies. Evaluated on real-world Bing Ads data, the approach recovers over 98% of the teacherβs accuracy while reducing online latency by 27Γ and increasing throughput by 20Γ. A/B tests demonstrate a 1% gain in ad revenue, along with 0.6% and 0.4% increases in impressions and clicks, respectively.
π Abstract
In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.