HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving high throughput, low latency, and strong retrieval quality in sponsored search. To this end, the authors propose a production-oriented three-stage training paradigm: first, a billion-parameter small language model (SLM) is fine-tuned as a teacher; second, its knowledge is distilled into a lightweight encoder with fewer than 600 million parameters via L2 representation alignment; and third, the student model is further refined through contrastive learning. The study systematically investigates key design factors, including embedding dimensionality, model scale, and optimization strategies. Evaluated on real-world Bing Ads data, the approach recovers over 98% of the teacher’s accuracy while reducing online latency by 27× and increasing throughput by 20×. A/B tests demonstrate a 1% gain in ad revenue, along with 0.6% and 0.4% increases in impressions and clicks, respectively.

📝 Abstract

In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.

Problem

Research questions and friction points this paper is trying to address.