MMTEB: Massive Multilingual Text Embedding Benchmark

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text embedding evaluations suffer from limited task scale and narrow coverage of languages and domains. This work introduces MTEB-250+, the first ultra-large-scale multilingual benchmark, encompassing 250+ languages and 500+ high-quality tasks—including emerging challenges such as instruction-following, long-document retrieval, and code retrieval. To address scalability and reliability, we propose two innovations: (1) a task correlation–based downsampling strategy that drastically reduces computational overhead; and (2) adaptive hard-negative sampling coupled with a zero-shot English sub-benchmark to ensure both efficiency and robust ranking fidelity. Experiments demonstrate that MTEB-250+ reduces evaluation cost to approximately one-tenth of the full-scale benchmark while maintaining a Pearson correlation of ≥0.98 for model rankings. Empirical validation identifies multilingual-e5-large-instruct as the current state-of-the-art open-source embedding model.

Technology Category

Application Category

📝 Abstract
Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multilingual text embedding models comprehensively
Introduces large-scale, diverse evaluation tasks
Reduces computational cost with optimized methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Massive Multilingual Text Embedding Benchmark
Novel downsampling method for efficiency
Optimized retrieval with hard negatives
🔎 Similar Papers
No similar papers found.
K
Kenneth C. Enevoldsen
Aarhus University
Isaac Chung
Isaac Chung
Zendesk
Machine LearningComputer VisionNatural Language Processing
I
Imene Kerboua
Esker, INSA Lyon, LIRIS
Márton Kardos
Márton Kardos
Junior Developer, Center for Humanities Computing, Aarhus University
nlptopic modelingBayesian machine learningmodel interpretability
A
Ashwin Mathur
Individual Contributor
David Stap
David Stap
NXAI
Machine TranslationMachine LearningNatural Language Processing
J
Jay Gala
MBZUAI
Wissam Siblini
Wissam Siblini
PhD, Machine Learning, Worldline R&D
Machine LearningExtreme Multi-Label ClassificationDimensionality Reduction
D
Dominik Krzemiński
Individual Contributor
Genta Indra Winata
Genta Indra Winata
Capital One AI Foundations
MultilingualityLanguage ModelingMultimodalLow-resource NLPCode-Switching
Saba Sturua
Saba Sturua
ML Research Engineer
Natural Language ProcessingMachine Learning
S
Saiteja Utpala
Microsoft Research
M
Mathieu Ciancone
Wikit
M
Marion Schaeffer
Wikit
G
Gabriel Sequeira
Individual Contributor
D
Diganta Misra
McGill University
S
Shreeya Dhakal
Individual Contributor
J
Jonathan Rystrøm
University of Oxford
R
Roman Solomatin
ITMO University
Ö
Ömer Çağatan
Koç University
A
Akash Kundu
Heritage Institute of Technology, Apart Research
M
Martin Bernstorff
Aarhus University
Shitao Xiao
Shitao Xiao
BUPT
A
Akshita Sukhlecha
Individual Contributor
B
Bhavish Pahwa
Microsoft Research
Rafał Poświata
Rafał Poświata
National Information Processing Institute
natural language processingmachine learningdeep learningsentiment analysis
K
Kranthi Kiran GV
New York University
S
Shawon Ashraf
Ellamind
Daniel Auras
Daniel Auras
Founding AI Engineer, ellamind
B
Björn Plüster
Ellamind
J
Jan Philipp Harries
Ellamind
L
Loïc Magne
Individual Contributor
Isabelle Mohr
Isabelle Mohr
Machine Learning Engineer, Jina AI
NLPcomputer visioncomputational linguistics
Mariya Hendriksen
Mariya Hendriksen
University of Oxford
Artificial IntelligenceVision and LanguageAI for Neuroscience
D
Dawei Zhu
Peking University
Hippolyte Gisserot-Boukhlef
Hippolyte Gisserot-Boukhlef
PhD Candidate, CentraleSupélec, Université Paris-Saclay
Artificial IntelligenceLLMs
T
Tom Aarsen
Hugging Face
J
Jan Kostkan
Aarhus University
Konrad Wojtasik
Konrad Wojtasik
Wrocław University of Science and Technology
Natural Language Processing
T
Taemin Lee
Korea University
Marek Šuppa
Marek Šuppa
Comenius University in Bratislava
Natural Language ProcessingComputer VisionMachine Learning
Crystina Zhang
Crystina Zhang
University of Waterloo
Information RetrievalNatural Language Processing
R
Roberta Rocca
Aarhus University
M
Mohammed Hamdy
Cohere For AI
A
Andrianos Michail
University of Zurich
J
John Yang
Stanford University
Manuel Faysse
Manuel Faysse
CentraleSupélec - Université Paris Saclay
Natural Language ProcessingMachine LearningPrivacy
A
Aleksei Vatolin
FRC CSC RAS
Nandan Thakur
Nandan Thakur
PhD Student, University of Waterloo
information retrievalnatural language processingdeep learningmachine learning
M
Manan Dey
Salesforce
D
Dipam Vasani
Individual Contributor
P
Pranjal Chitale
IIT Madras
Simone Tedeschi
Simone Tedeschi
Applied Scientist @ Amazon
Natural Language ProcessingLarge Language ModelsResponsible AI
Nguyen Tai
Nguyen Tai
University of Pennsylvania
Natural Language Processing
A
Artem Snegirev
SaluteDevices
M
Michael Günther
Jina AI
Mengzhou Xia
Mengzhou Xia
Princeton University
Natural Language ProcessingMachine Learning
Weijia Shi
Weijia Shi
University of Washington
Natural Language ProcessingMachine Learning
X
Xing Han Lu
McGill University
J
Jordan Clive
Imperial College London
G
Gayatri Krishnakumar
R. V. College of Engineering
A
Anna Maksimova
SaluteDevices
S
Silvan Wehrli
Robert Koch Institute
M
Maria Tikhonova
SaluteDevices, HSE University
H
Henil Panchal
Nirma University
A
Aleksandr Abramov
SaluteDevices
Malte Ostendorff
Malte Ostendorff
University of Göttingen / German Research Center for Artificial Intelligence
Large language modelsRecommender systemsInformation retrieval
Z
Zheng Liu
BAAI
S
Simon Clematide
University of Zurich
L
Lester James Miranda
Allen Institute for AI
A
Alena Fenogenova
SaluteDevices
G
Guangyu Song
Tano Labs
R
Ruqiya Bin Safi
The London Institute of Banking and Finance
Wen-Ding Li
Wen-Ding Li
Cornell University
Machine Learning
A
Alessia Borghini
Sapienza University of Rome
Federico Cassano
Federico Cassano
Northeastern University
Artificial IntelligenceProgramming LanguagesSupply Chain Security
H
Hongjin Su
Hong Kong University
Jimmy Lin
Jimmy Lin
University of Waterloo
information retrievalnatural language processingdata managementbig data
Howard Yen
Howard Yen
Princeton University
Natural language processing
Lasse Hansen
Lasse Hansen
echoscout.ai
Sara Hooker
Sara Hooker
Head of Cohere For AI
Machine learning efficiencyrobustnessinterpretabilitytrustworthy ML
Chenghao Xiao
Chenghao Xiao
Durham University
Natural Language ProcessingInformation RetrievalRepresentation Learning
V
Vaibhav Adlakha
McGill University, ServiceNow Research
Orion Weller
Orion Weller
Johns Hopkins University
Natural Language ProcessingInformation RetrievalMachine Learning
Siva Reddy
Siva Reddy
McGill University, Mila Quebec AI Institute
Natural Language ProcessingComputational LinguisticsDeep LearningSemantics
Niklas Muennighoff
Niklas Muennighoff
Stanford University
large language modelsartificial intelligencemachine learning