π€ AI Summary
This work addresses the challenge of training and evaluating ranking models for natural language search systems during cold-start scenarios, where authentic queries and relevance labels are scarce. The authors propose a seed-guided synthetic query generation approach that leverages contrastive property pairs and large language models to produce high-fidelity, diverse query-document pairs. To enhance label discriminability, they introduce a contrastive label generation mechanism alongside a Virtual Judge annotation strategy. The resulting end-to-end synthetic data pipeline substantially improves data distribution fidelity: KL divergences for query length and attribute distributions drop to 0.66 (a 7.5Γ improvement over InPars) and 0.04, respectively, outperforming seed queries. Moreover, the generated evaluation samples are more challenging, effectively driving continuous improvement in retrieval and ranking models.
π Abstract
Deploying natural language search systems presents a critical cold-start challenge: no real user queries to learn linguistic patterns, and no relevance labels to train ranking models. We present a framework for generating synthetic queries and labels using large language models (LLMs), powering model training and evaluation for Airbnb's natural language search.
For query generation, we combine contrastive listing pairs from booking sessions with seed queries from user research to balance realism and diversity, enabling a cold-to-warm start transition as real user data becomes available. For label generation, we introduce contrastive generation that produces topicality labels by construction, and Virtual Judge (VJ) labeling for broader coverage.
We compare our approach against a no-seed contrastive baseline and an InPars-style baseline. For query length, the InPars baseline produces verbose queries with KL divergence of 12.03 vs. real users; our seed-guided approach achieves 0.66, a 7.5x improvement. For attribute type distributions, our approach achieves the lowest KL divergence (0.04), outperforming even seed queries (0.09). Experiments show our approach produces harder evaluation examples than the no-seed baseline (79% vs. 97% pairwise accuracy), providing discriminative signal for model improvement. We deploy production pipelines generating synthetic examples daily for embedding-based retrieval and ranking evaluation.