jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of representing and matching visually rich multimodal multilingual content—such as tables, charts, and mixed text-image documents—in retrieval tasks. To this end, we propose Jina-VLM, a unified text-image embedding model built upon a 3.8B-parameter Transformer architecture. It introduces a novel hybrid late-interaction mechanism supporting both single- and multi-vector representations, coupled with task-specific LoRA adapters for efficient cross-scenario adaptation. To bridge the evaluation gap, we construct Jina-VDR—the first benchmark dedicated to mixed text-image document retrieval. Extensive experiments demonstrate that Jina-VLM achieves state-of-the-art performance across cross-modal semantic matching, code search, and unimodal retrieval. Notably, it significantly outperforms existing methods in vision-intensive scenarios, including table and chart understanding.

Technology Category

Application Category

📝 Abstract
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-based information retrieval, cross-modal semantic similarity, and programming code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single- modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.
Problem

Research questions and friction points this paper is trying to address.

Develops universal embeddings for multimodal multilingual retrieval
Optimizes performance across diverse retrieval scenarios
Enhances retrieval of visually rich content like tables and diagrams
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal embedding model unifying text and image
Task-specific LoRA adapters for diverse retrieval
State-of-the-art performance in cross-modal retrieval
🔎 Similar Papers
No similar papers found.
M
Michael Günther
Jina AI GmbH, Prinzessinnenstraße 19, 10969, Berlin, Germany
Saba Sturua
Saba Sturua
ML Research Engineer
Natural Language ProcessingMachine Learning
M
Mohammad Kalim Akram
Jina AI GmbH, Prinzessinnenstraße 19, 10969, Berlin, Germany
Isabelle Mohr
Isabelle Mohr
Machine Learning Engineer, Jina AI
NLPcomputer visioncomputational linguistics
A
Andrei Ungureanu
Jina AI GmbH, Prinzessinnenstraße 19, 10969, Berlin, Germany
B
Bo Wang
Jina AI GmbH, Prinzessinnenstraße 19, 10969, Berlin, Germany
S
Sedigheh Eslami
Jina AI GmbH, Prinzessinnenstraße 19, 10969, Berlin, Germany
S
Scott Martens
Jina AI GmbH, Prinzessinnenstraße 19, 10969, Berlin, Germany
N
Nan Wang
Jina AI GmbH, Prinzessinnenstraße 19, 10969, Berlin, Germany
H
Han Xiao
Jina AI GmbH, Prinzessinnenstraße 19, 10969, Berlin, Germany