Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale retrieval datasets suffer from pervasive “false negatives”—relevant documents erroneously labeled as irrelevant—severely degrading retrieval model performance. To address this, we propose a cascaded large language model (LLM) prompting framework that leverages lightweight zero-shot cascade reasoning (GPT-4o → GPT-4o-mini) to automatically identify and relabel hard negative samples, without human annotation or model fine-tuning. Our method significantly improves dataset quality and model robustness: on the BEIR benchmark, nDCG@10 increases by 0.7–1.4 points for E5 and Qwen2.5-7B retrievers; zero-shot AIR-Bench scores improve by 1.7–1.8 points; and downstream re-rankers (e.g., Qwen2.5-3B) also benefit. The core contribution is the first application of efficient, scalable cascaded LLM inference to retrieval data cleaning—striking a practical balance between computational efficiency, generalizability, and real-world applicability.

Technology Category

Application Category

📝 Abstract
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$ imes$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on"false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
Problem

Research questions and friction points this paper is trying to address.

Identifying false negatives in retrieval datasets harms performance
Proposing LLM cascades to relabel hard negatives effectively
Improving retrieval models by correcting mislabeled training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascading LLMs relabel hard negatives
Improves retrieval models performance
Cost-effective false negatives correction