Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval

πŸ“… 2026-02-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of visual document retrieval in RAG systems, which suffer from reliance on OCR and insufficient preservation of visual information in dense retrieval. To overcome these challenges, the authors propose a Late Interaction embedding approach based on pretrained vision-language models. By integrating cluster-based sampling, hard negative mining, bidirectional attention mechanisms, and model fusion strategies, the method achieves substantial performance gains on the ViDoRe benchmark. Leveraging NVIDIA Eagle 2 and Qwen3-VL as backbone architectures, the authors release three model variants (3B, 4B, and 8B), with the 8B version achieving state-of-the-art results on the ViDoRe V3 leaderboard, attaining an average NDCG@10 of 63.42. The study also investigates low-dimensional embeddings to balance storage efficiency and retrieval accuracy.

Technology Category

Application Category

πŸ“ Abstract
Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering language models by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.2 3B backbone, Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, respectively. The 8B model ranks first on the ViDoRe V3 leaderboard as of February 03, 2026, achieving an average NDCG@10 of 63.42. We describe the main techniques used across data processing, training, and post-training - such as cluster-based sampling, hard-negative mining, bidirectional attention, late interaction, and model merging - that helped us build our top-performing models. We also discuss compute and storage engineering challenges posed by the late interaction mechanism and present experiments on how to balance accuracy and storage with lower dimension embeddings.
Problem

Research questions and friction points this paper is trying to address.

Visual Document Retrieval
Retrieval-Augmented Generation
Dense Retrieval
Embedding Models
Late Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

late interaction
visual document retrieval
vision-language model
hard-negative mining
embedding compression
πŸ”Ž Similar Papers
G
G. Moreira
NVIDIA
Ronay Ak
Ronay Ak
National Institute of Standards and Technology
Machine LearningSmart ManufacturingSmart Grid
M
Mengyao Xu
NVIDIA
O
Oliver Holworthy
NVIDIA
Benedikt Schifferer
Benedikt Schifferer
NVIDIA
Deep LearningNLPRecommender Systems
Zhiding Yu
Zhiding Yu
Principal Research Scientist & Research Lead, NVIDIA Research
Computer VsionDeep Learning
Y
Yauhen Babakhin
NVIDIA
R
Radek Osmulski
NVIDIA
Jiarui Cai
Jiarui Cai
AWS AI
R
Ryan Chesler
NVIDIA
B
Bo Liu
NVIDIA
E
Even Oldridge
NVIDIA