Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address insufficient accuracy in multimodal text–image retrieval, this paper proposes a novel cross-modal retrieval method built upon the NVIDIA Eagle2 vision-language model (VLM). The core contribution lies in the first integration within a VLM of bidirectional attention—replacing causal attention—with ColBERT-style late interaction, enabling fine-grained, interpretable cross-modal matching in a shared embedding space. Additionally, a two-stage training strategy is introduced to jointly optimize representation learning and interaction modeling. Experiments demonstrate that the proposed 3B-parameter model achieves state-of-the-art performance on the ViDoRe V1 and V2 benchmarks, attaining NDCG@5 scores of 91.0 and 63.5, respectively—the highest reported as of June 27, 2025—thereby significantly improving both retrieval accuracy and generalization capability.

Technology Category

Application Category

📝 Abstract

Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.

Problem

Research questions and friction points this paper is trying to address.

Develops a top-performing text-image retrieval model

Addresses multimodal retrieval with bidirectional attention

Analyzes trade-offs in storage and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses bidirectional attention in VLM architecture

Integrates ColBERT-style late interaction mechanism

Employs two-stage training for enhanced retrieval

🔎 Similar Papers

No similar papers found.