Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world documents—such as PDFs, slides, and videos—contain rich heterogeneous visual and semantic information; conventional text-based retrievers suffer from performance bottlenecks due to their reliance on structured textual inputs. To address this, we propose the first unified multimodal retrieval framework supporting end-to-end joint modeling of text, images, audio, and video—breaking the constraints of unimodal paradigms. Our method leverages state-of-the-art multimodal large language models (e.g., Qwen2.5-Omni) to generate image-augmented document representations, enabling deep cross-modal alignment and learning of a shared embedding space for both cross-modal and fused-modal retrieval. Evaluated on diverse multimodal benchmarks, our approach significantly outperforms existing methods, demonstrating strong generalization to unstructured documents and practical applicability in real-world scenarios. The core contributions are: (1) the first end-to-end four-modal unified retrieval architecture, and (2) systematic improvements in multimodal content understanding and retrieval accuracy.

Technology Category

Application Category

📝 Abstract
We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.
Problem

Research questions and friction points this paper is trying to address.

Unified multimodal retrieval for text, image, audio, and video
Overcoming limitations of text-based retrievers with rich content
Enabling cross-modal and joint-modal retrieval using single model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal embedding model for text, image, audio, video
Enables cross-modal and joint-modal retrieval with single model
Extends retrieval beyond text using image-based representations
🔎 Similar Papers
No similar papers found.
M
Mengyao Xu
NVIDIA, Santa Clara, USA
W
Wenfei Zhou
NVIDIA, Los Angeles, USA
Y
Yauhen Babakhin
NVIDIA, Prague, Czechia
G
Gabriel Moreira
NVIDIA, São Paulo, Brazil
Ronay Ak
Ronay Ak
National Institute of Standards and Technology
Machine LearningSmart ManufacturingSmart Grid
R
Radek Osmulski
NVIDIA, Brisbane, Australia
B
Bo Liu
NVIDIA, New York, USA
E
Even Oldridge
NVIDIA, Vancouver, Canada
Benedikt Schifferer
Benedikt Schifferer
NVIDIA
Deep LearningNLPRecommender Systems