Remining Hard Negatives for Generative Pseudo Labeled Domain Adaptation

๐Ÿ“… 2025-01-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

186K/year
๐Ÿค– AI Summary
Dense retrievers exhibit poor robustness in cross-domain zero-shot settings, primarily due to insufficient quality of hard negatives, which severely limits generalization. To address this, we propose R-GPLโ€”a refinement of the GPL domain adaptation framework that introduces a dynamic hard negative resampling mechanism. Specifically, R-GPL quantifies the relevance discrepancy of hard negatives selected before and after domain adaptation, and iteratively improves negative quality during knowledge distillation. The method jointly integrates generative pseudo-labeling, cross-encoder distillation, and dense retriever fine-tuningโ€”requiring no labeled data from the target domain. Evaluated on BEIR (14 datasets) and LoTTe (12 datasets), R-GPL achieves statistically significant improvements on 13/14 and 9/12 tasks, respectively. These results empirically validate the effectiveness and strong cross-domain generalizability of dynamic hard negative resampling.

Technology Category

Application Category

๐Ÿ“ Abstract
Dense retrievers have demonstrated significant potential for neural information retrieval; however, they exhibit a lack of robustness to domain shifts, thereby limiting their efficacy in zero-shot settings across diverse domains. A state-of-the-art domain adaptation technique is Generative Pseudo Labeling (GPL). GPL uses synthetic query generation and initially mined hard negatives to distill knowledge from cross-encoder to dense retrievers in the target domain. In this paper, we analyze the documents retrieved by the domain-adapted model and discover that these are more relevant to the target queries than those of the non-domain-adapted model. We then propose refreshing the hard-negative index during the knowledge distillation phase to mine better hard negatives. Our remining R-GPL approach boosts ranking performance in 13/14 BEIR datasets and 9/12 LoTTe datasets. Our contributions are (i) analyzing hard negatives returned by domain-adapted and non-domain-adapted models and (ii) applying the GPL training with and without hard-negative re-mining in LoTTE and BEIR datasets.
Problem

Research questions and friction points this paper is trying to address.

Cross-domain retrieval
Density retrieval
Accuracy improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved Pseudo Labeling
Adaptive Dense Retrieval
Quality Enhancement
๐Ÿ”Ž Similar Papers
No similar papers found.