๐ค AI Summary
This work addresses the reliance of dense retrieval model training on corpus access for hard negative (HN) mining. We propose a fully corpus-free, end-to-end LLM-based paradigm: first generating relevant queries from passages, then synthesizing high-quality hard negatives solely from those queriesโwithout invoking BM25, cross-encoders, or the original document collection. Leveraging prompt engineering, our method directs embedding models (e.g., E5-Base, GTE-Base) to perform query-conditioned negative generation, drastically reducing computational and storage overhead. On the BEIR benchmark, it matches or exceeds BM25 and cross-encoder baselines across nDCG@10, P@10, and R@100. To our knowledge, this is the first approach enabling zero-corpus-dependency hard negative synthesis. We publicly release the complete synthetic dataset, establishing a novel paradigm for retrieval model training in low-resource settings.
๐ Abstract
Training effective dense retrieval models often relies on hard negative (HN) examples mined from the document corpus via methods like BM25 or cross-encoders (CE), processes that can be computationally demanding and require full corpus access. This paper introduces a different approach, an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage, and then generates a hard negative example using emph{only} that query text. This corpus-free negative generation contrasts with standard mining techniques. We evaluated this extsc{LLM Query $
ightarrow$ LLM HN} approach against traditional extsc{LLM Query $
ightarrow$ BM25 HN} and extsc{LLM Query $
ightarrow$ CE HN} pipelines using E5-Base and GTE-Base models on several BEIR benchmark datasets. Our results show the proposed all-LLM pipeline achieves performance identical to both the BM25 and the computationally intensive CE baselines across nDCG@10, Precision@10, and Recall@100 metrics. This demonstrates that our corpus-free negative generation method matches the effectiveness of complex, corpus-dependent mining techniques, offering a potentially simpler and more efficient pathway for training high-performance retrievers without sacrificing results. We make the dataset including the queries and the hard-negatives for all three methods publicly available https://huggingface.co/collections/chungimungi/arxiv-hard-negatives-68027bbc601ff6cc8eb1f449.