MIEB: Massive Image Embedding Benchmark

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing evaluation methods for image and multimodal (image–text) embeddings suffer from task fragmentation, narrow language coverage, and superficial capability characterization. Method: This paper introduces MIEB—the first large-scale, multilingual, cross-task unified benchmark for multimodal embedding evaluation—comprising 130 tasks across 38 languages and 8 high-level capabilities. It proposes three novel methodological components: cross-modal embedding alignment evaluation, hierarchical task grouping, and implicit capability probing, uncovering new bottlenecks such as textual representation bias and encoding confusion robustness. Contribution/Results: Evaluated on 50 state-of-the-art models, MIEB reveals no single model dominates across all dimensions; notably, visual encoder scores on MIEB strongly correlate with downstream performance of multimodal large language models. All code, data, and leaderboards are publicly released to advance standardization in embedding model evaluation.

Technology Category

Application Category

📝 Abstract

Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

Problem

Research questions and friction points this paper is trying to address.

Evaluating image embedding models across diverse tasks

Assessing model performance in multilingual image-text tasks

Identifying hidden capabilities and limitations of vision models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Massive Image Embedding Benchmark (MIEB)

Evaluates 50 models across 130 tasks

Assesses multilingual image-text embedding capabilities

🔎 Similar Papers

No similar papers found.