Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the lack of unified, efficient tools for multilingual and multimodal document retrieval. We propose OmniEmbed—the first general-purpose dense retrieval framework supporting joint embedding of text, images, videos, and audio. Methodologically, it employs a contrastive learning–based unified architecture incorporating cross-modal zero-shot transfer, modality-specific adapters, and an efficient hard negative sampling strategy, while enabling billion-scale distributed training and evaluation. Key contributions include: (1) the first unified four-modal embedding framework; (2) a novel cross-modal zero-shot transfer mechanism enabling effective retrieval across unseen modalities without fine-tuning; and (3) open-sourcing the OmniEmbed model and toolkit. OmniEmbed achieves state-of-the-art performance across multilingual and multimodal benchmarks; its zero-shot cross-modal retrieval significantly outperforms unimodal baselines. The framework has already been integrated into multiple industrial systems.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have driven interest in billion-scale retrieval models with strong generalization across retrieval tasks and languages. Additionally, progress in large vision-language models has created new opportunities for multimodal retrieval. In response, we have updated the Tevatron toolkit, introducing a unified pipeline that enables researchers to explore retriever models at different scales, across multiple languages, and with various modalities. This demo paper highlights the toolkit's key features, bridging academia and industry by supporting efficient training, inference, and evaluation of neural retrievers. We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness, and conduct a cross-modality zero-shot study to demonstrate its research potential. Alongside, we release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval, serving as a baseline for future research.

Problem

Research questions and friction points this paper is trying to address.

Unified retrieval across scale, language, and modality

Bridging academia and industry with efficient retriever training

First embedding model for text, image, video, and audio retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified pipeline for multi-scale, multilingual, multimodal retrieval

OmniEmbed model unifying text, image, video, audio retrieval

Efficient training, inference, evaluation of neural retrievers

🔎 Similar Papers

No similar papers found.