🤖 AI Summary
Large-scale retrieval datasets suffer from pervasive “false negatives”—relevant documents erroneously labeled as irrelevant—severely degrading retrieval model performance. To address this, we propose a cascaded large language model (LLM) prompting framework that leverages lightweight zero-shot cascade reasoning (GPT-4o → GPT-4o-mini) to automatically identify and relabel hard negative samples, without human annotation or model fine-tuning. Our method significantly improves dataset quality and model robustness: on the BEIR benchmark, nDCG@10 increases by 0.7–1.4 points for E5 and Qwen2.5-7B retrievers; zero-shot AIR-Bench scores improve by 1.7–1.8 points; and downstream re-rankers (e.g., Qwen2.5-3B) also benefit. The core contribution is the first application of efficient, scalable cascaded LLM inference to retrieval data cleaning—striking a practical balance between computational efficiency, generalizability, and real-world applicability.
📝 Abstract
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$ imes$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on"false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.