Remining Hard Negatives for Generative Pseudo Labeled Domain Adaptation

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dense retrievers exhibit poor robustness in cross-domain zero-shot settings, primarily due to insufficient quality of hard negatives, which severely limits generalization. To address this, we propose R-GPL—a refinement of the GPL domain adaptation framework that introduces a dynamic hard negative resampling mechanism. Specifically, R-GPL quantifies the relevance discrepancy of hard negatives selected before and after domain adaptation, and iteratively improves negative quality during knowledge distillation. The method jointly integrates generative pseudo-labeling, cross-encoder distillation, and dense retriever fine-tuning—requiring no labeled data from the target domain. Evaluated on BEIR (14 datasets) and LoTTe (12 datasets), R-GPL achieves statistically significant improvements on 13/14 and 9/12 tasks, respectively. These results empirically validate the effectiveness and strong cross-domain generalizability of dynamic hard negative resampling.

Technology Category

Application Category

📝 Abstract
Dense retrievers have demonstrated significant potential for neural information retrieval; however, they exhibit a lack of robustness to domain shifts, thereby limiting their efficacy in zero-shot settings across diverse domains. A state-of-the-art domain adaptation technique is Generative Pseudo Labeling (GPL). GPL uses synthetic query generation and initially mined hard negatives to distill knowledge from cross-encoder to dense retrievers in the target domain. In this paper, we analyze the documents retrieved by the domain-adapted model and discover that these are more relevant to the target queries than those of the non-domain-adapted model. We then propose refreshing the hard-negative index during the knowledge distillation phase to mine better hard negatives. Our remining R-GPL approach boosts ranking performance in 13/14 BEIR datasets and 9/12 LoTTe datasets. Our contributions are (i) analyzing hard negatives returned by domain-adapted and non-domain-adapted models and (ii) applying the GPL training with and without hard-negative re-mining in LoTTE and BEIR datasets.
Problem

Research questions and friction points this paper is trying to address.

Cross-domain retrieval
Density retrieval
Accuracy improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved Pseudo Labeling
Adaptive Dense Retrieval
Quality Enhancement
🔎 Similar Papers
No similar papers found.