🤖 AI Summary
This work addresses the challenge of efficiently constructing a unified multimodal embedding space while preserving the performance of original text embeddings. To this end, the authors propose a frozen tower ensemble architecture that keeps both the pretrained text encoder and separate image and audio encoders fixed, aligning modalities solely through a lightweight connection module comprising only 0.35% of the total trainable parameters. Without fine-tuning any backbone models, the approach successfully maps text, images, audio, and video into a shared embedding space. It maintains strong textual embedding consistency while achieving multimodal retrieval performance comparable to that of significantly larger state-of-the-art models.
📝 Abstract
In this work, we introduce frozen-encoder model composition, a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. Our method is to extend the two Jina Embeddings v5 Text models to support additional media by adding encoders for images and audio. The backbone text embedding models and the added non-text media encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that this approach produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.