🤖 AI Summary
This paper addresses the challenge of representing and matching visually rich multimodal multilingual content—such as tables, charts, and mixed text-image documents—in retrieval tasks. To this end, we propose Jina-VLM, a unified text-image embedding model built upon a 3.8B-parameter Transformer architecture. It introduces a novel hybrid late-interaction mechanism supporting both single- and multi-vector representations, coupled with task-specific LoRA adapters for efficient cross-scenario adaptation. To bridge the evaluation gap, we construct Jina-VDR—the first benchmark dedicated to mixed text-image document retrieval. Extensive experiments demonstrate that Jina-VLM achieves state-of-the-art performance across cross-modal semantic matching, code search, and unimodal retrieval. Notably, it significantly outperforms existing methods in vision-intensive scenarios, including table and chart understanding.
📝 Abstract
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-based information retrieval, cross-modal semantic similarity, and programming code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single- modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.