jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This paper addresses the challenge of representing and matching visually rich multimodal multilingual content—such as tables, charts, and mixed text-image documents—in retrieval tasks. To this end, we propose Jina-VLM, a unified text-image embedding model built upon a 3.8B-parameter Transformer architecture. It introduces a novel hybrid late-interaction mechanism supporting both single- and multi-vector representations, coupled with task-specific LoRA adapters for efficient cross-scenario adaptation. To bridge the evaluation gap, we construct Jina-VDR—the first benchmark dedicated to mixed text-image document retrieval. Extensive experiments demonstrate that Jina-VLM achieves state-of-the-art performance across cross-modal semantic matching, code search, and unimodal retrieval. Notably, it significantly outperforms existing methods in vision-intensive scenarios, including table and chart understanding.

Technology Category

Application Category

📝 Abstract

We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-based information retrieval, cross-modal semantic similarity, and programming code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single- modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

Problem

Research questions and friction points this paper is trying to address.

Develops universal embeddings for multimodal multilingual retrieval

Optimizes performance across diverse retrieval scenarios

Enhances retrieval of visually rich content like tables and diagrams

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal embedding model unifying text and image

Task-specific LoRA adapters for diverse retrieval

State-of-the-art performance in cross-modal retrieval

🔎 Similar Papers

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation