🤖 AI Summary
Existing evaluation methods for image and multimodal (image–text) embeddings suffer from task fragmentation, narrow language coverage, and superficial capability characterization. Method: This paper introduces MIEB—the first large-scale, multilingual, cross-task unified benchmark for multimodal embedding evaluation—comprising 130 tasks across 38 languages and 8 high-level capabilities. It proposes three novel methodological components: cross-modal embedding alignment evaluation, hierarchical task grouping, and implicit capability probing, uncovering new bottlenecks such as textual representation bias and encoding confusion robustness. Contribution/Results: Evaluated on 50 state-of-the-art models, MIEB reveals no single model dominates across all dimensions; notably, visual encoder scores on MIEB strongly correlate with downstream performance of multimodal large language models. All code, data, and leaderboards are publicly released to advance standardization in embedding model evaluation.
📝 Abstract
Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.