Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval

📅 2025-04-20

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the reliance of dense retrieval model training on corpus access for hard negative (HN) mining. We propose a fully corpus-free, end-to-end LLM-based paradigm: first generating relevant queries from passages, then synthesizing high-quality hard negatives solely from those queries—without invoking BM25, cross-encoders, or the original document collection. Leveraging prompt engineering, our method directs embedding models (e.g., E5-Base, GTE-Base) to perform query-conditioned negative generation, drastically reducing computational and storage overhead. On the BEIR benchmark, it matches or exceeds BM25 and cross-encoder baselines across nDCG@10, P@10, and R@100. To our knowledge, this is the first approach enabling zero-corpus-dependency hard negative synthesis. We publicly release the complete synthetic dataset, establishing a novel paradigm for retrieval model training in low-resource settings.

Technology Category

Application Category

📝 Abstract

Training effective dense retrieval models often relies on hard negative (HN) examples mined from the document corpus via methods like BM25 or cross-encoders (CE), processes that can be computationally demanding and require full corpus access. This paper introduces a different approach, an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage, and then generates a hard negative example using emph{only} that query text. This corpus-free negative generation contrasts with standard mining techniques. We evaluated this extsc{LLM Query $ ightarrow$ LLM HN} approach against traditional extsc{LLM Query $ ightarrow$ BM25 HN} and extsc{LLM Query $ ightarrow$ CE HN} pipelines using E5-Base and GTE-Base models on several BEIR benchmark datasets. Our results show the proposed all-LLM pipeline achieves performance identical to both the BM25 and the computationally intensive CE baselines across nDCG@10, Precision@10, and Recall@100 metrics. This demonstrates that our corpus-free negative generation method matches the effectiveness of complex, corpus-dependent mining techniques, offering a potentially simpler and more efficient pathway for training high-performance retrievers without sacrificing results. We make the dataset including the queries and the hard-negatives for all three methods publicly available https://huggingface.co/collections/chungimungi/arxiv-hard-negatives-68027bbc601ff6cc8eb1f449.

Problem

Research questions and friction points this paper is trying to address.

Generating synthetic hard negatives without corpus access for dense retrieval

Using LLMs to create queries and hard negatives from passages

Evaluating corpus-free training against traditional mining methods on BEIR benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates queries from passages for data creation

Produces hard negatives using only query text

Corpus-free pipeline matches cross-encoder performance

🔎 Similar Papers

No similar papers found.