Bridging the Cold-Start Gap: LLM-Powered Synthetic Data Generation for Natural Language Search at Airbnb

πŸ“… 2026-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

182K/year
πŸ€– AI Summary
This work addresses the challenge of training and evaluating ranking models for natural language search systems during cold-start scenarios, where authentic queries and relevance labels are scarce. The authors propose a seed-guided synthetic query generation approach that leverages contrastive property pairs and large language models to produce high-fidelity, diverse query-document pairs. To enhance label discriminability, they introduce a contrastive label generation mechanism alongside a Virtual Judge annotation strategy. The resulting end-to-end synthetic data pipeline substantially improves data distribution fidelity: KL divergences for query length and attribute distributions drop to 0.66 (a 7.5Γ— improvement over InPars) and 0.04, respectively, outperforming seed queries. Moreover, the generated evaluation samples are more challenging, effectively driving continuous improvement in retrieval and ranking models.
πŸ“ Abstract
Deploying natural language search systems presents a critical cold-start challenge: no real user queries to learn linguistic patterns, and no relevance labels to train ranking models. We present a framework for generating synthetic queries and labels using large language models (LLMs), powering model training and evaluation for Airbnb's natural language search. For query generation, we combine contrastive listing pairs from booking sessions with seed queries from user research to balance realism and diversity, enabling a cold-to-warm start transition as real user data becomes available. For label generation, we introduce contrastive generation that produces topicality labels by construction, and Virtual Judge (VJ) labeling for broader coverage. We compare our approach against a no-seed contrastive baseline and an InPars-style baseline. For query length, the InPars baseline produces verbose queries with KL divergence of 12.03 vs. real users; our seed-guided approach achieves 0.66, a 7.5x improvement. For attribute type distributions, our approach achieves the lowest KL divergence (0.04), outperforming even seed queries (0.09). Experiments show our approach produces harder evaluation examples than the no-seed baseline (79% vs. 97% pairwise accuracy), providing discriminative signal for model improvement. We deploy production pipelines generating synthetic examples daily for embedding-based retrieval and ranking evaluation.
Problem

Research questions and friction points this paper is trying to address.

cold-start
natural language search
synthetic data generation
relevance labeling
query generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation
large language models
cold-start problem
contrastive generation
natural language search
πŸ”Ž Similar Papers
W
Wendy Ran Wei
Airbnb
H
Hao Li
Airbnb
W
Weiwei Guo
Airbnb
X
Xiaowei Liu
Airbnb
X
Xueyin Chen
Airbnb
D
Dillon Davis
Airbnb
Malay Haldar
Malay Haldar
Airbnb
Machine LearningSearch RankingRecommendation
S
Soumyadip Banerjee
Airbnb
K
Kedar Bellare
Airbnb
Huiji Gao
Huiji Gao
Airbnb
Search and RecommendationLocation Based Social NetworksData MiningSocial Networks
S
Stephanie Moyerman
Airbnb
Sanjeev Katariya
Sanjeev Katariya
Unknown affiliation
AI/MLQuantum PhysicsEvolutionary Biology