π€ AI Summary
This work addresses the challenge of low-quality hard-negative passages and spurious negatives in contrastive learning. To mitigate this, we propose a positive-aware hard-negative mining strategy that dynamically filters out false negatives by leveraging the relevance scores between queries and positive passages as anchors, thereby enabling precise selection of informative hard negatives. Our approach introduces, for the first time, a positive-guided mining paradigm that jointly optimizes training efficiency and retrieval accuracy. The underlying model is built upon a Transformer architecture and integrates contrastive learning, multi-stage teacher-student distillation, and configurable negative sampling. Evaluated on the MTEB Retrieval (BEIR) benchmark, NV-Retriever-v1 achieves a score of 60.9βranking first upon its release in July 2024βand demonstrates substantial improvements in both retrieval performance and cross-domain generalization of text embedding models.
π Abstract
Text embedding models have been popular for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). Those models are typically Transformer models that are fine-tuned with contrastive learning objectives. One of the challenging aspects of fine-tuning embedding models is the selection of high quality hard-negative passages for contrastive learning. In this paper we introduce a family of positive-aware mining methods that use the positive relevance score as an anchor for effective false negative removal, leading to faster training and more accurate retrieval models. We provide an ablation study on hard-negative mining methods over their configurations, exploring different teacher and base models. We further demonstrate the efficacy of our proposed mining methods at scale with the NV-Retriever-v1 model, which scores 60.9 on MTEB Retrieval (BEIR) benchmark and placed 1st when it was published to the MTEB Retrieval on July, 2024.