🤖 AI Summary
This work addresses critical limitations in existing omnimodal embedding approaches, which rely on implicit alignment within vision-language models (VLMs), leading to inconsistent similarity scales, ineffective negative samples, and misaligned cross-modal embedding geometries. To overcome these issues, we propose e5-omni, a lightweight yet effective explicit alignment framework that seamlessly adapts off-the-shelf VLMs through three key components: modality-aware temperature calibration, curriculum-based debiased negative sampling, and batch whitening with covariance regularization. This is the first approach to systematically resolve these three core challenges in omnimodal representation learning. Extensive experiments demonstrate that e5-omni significantly outperforms strong baselines on MMEB-V2 and AudioCaps benchmarks while exhibiting strong transferability. The code is publicly released to facilitate further research.
📝 Abstract
Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.