NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

📅 2024-05-27

🏛️ arXiv.org

📈 Citations: 58

✨ Influential: 8

career value

189K/year

🤖 AI Summary

To address the poor generalization and low training efficiency of decoder-only large language models (LLMs) on generic text embedding tasks, this paper introduces the NV-Embed family of methods. It proposes (i) an implicit attention pooling layer—replacing conventional [CLS] or mean pooling—to better capture salient semantic information; (ii) removal of causal masking to enable bidirectional context modeling; and (iii) a two-stage contrastive instruction-tuning paradigm integrating hard negative mining and controllable synthetic data generation for high-quality training set construction. The approach emphasizes simplicity, reproducibility, and strong cross-task generalization. Experiments demonstrate that NV-Embed-v1 and NV-Embed-v2 consecutively achieved first place on the MTEB leaderboard (May and August 2024), ranking top overall across 56 diverse tasks. In the AIR Benchmark, NV-Embed attains first place in long-document retrieval and second in question answering, substantially advancing the performance frontier of LLMs as versatile encoders.

Technology Category

Application Category

📝 Abstract

Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the lasttoken embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For training algorithm, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. For training data, we utilize the hard-negative mining, synthetic data generation and existing public available datasets to boost the performance of embedding model. By combining these techniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position on the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024 and August 30, 2024, respectively) across 56 embedding tasks, demonstrating the sustained effectiveness of the proposed methods over time. Additionally, it achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers a range of out-of-domain information retrieval topics beyond those in MTEB.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Efficient Training Methods

Multifunctional Encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

NV-Embed Model

Latent Attention Layer

Two-stage Contrastive Training

🔎 Similar Papers

No similar papers found.