🤖 AI Summary
Real-world documents—such as PDFs, slides, and videos—contain rich heterogeneous visual and semantic information; conventional text-based retrievers suffer from performance bottlenecks due to their reliance on structured textual inputs. To address this, we propose the first unified multimodal retrieval framework supporting end-to-end joint modeling of text, images, audio, and video—breaking the constraints of unimodal paradigms. Our method leverages state-of-the-art multimodal large language models (e.g., Qwen2.5-Omni) to generate image-augmented document representations, enabling deep cross-modal alignment and learning of a shared embedding space for both cross-modal and fused-modal retrieval. Evaluated on diverse multimodal benchmarks, our approach significantly outperforms existing methods, demonstrating strong generalization to unstructured documents and practical applicability in real-world scenarios. The core contributions are: (1) the first end-to-end four-modal unified retrieval architecture, and (2) systematic improvements in multimodal content understanding and retrieval accuracy.
📝 Abstract
We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.