MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dutch is severely underrepresented in existing multilingual embedding resources. To address this gap, we introduce MTEB-NL—the first comprehensive, Dutch-specific text embedding benchmark—covering diverse tasks including retrieval, classification, and clustering. We further propose the E5-NL family of embedding models, which innovatively combine real-world retrieval data with high-quality synthetic samples generated by large language models, trained via contrastive learning to yield efficient and compact representations. Experimental results demonstrate that E5-NL consistently outperforms baseline models across all tasks in MTEB-NL, with particularly notable gains in generalization on non-retrieval tasks. All benchmark data, evaluation infrastructure, and pretrained models are publicly released on Hugging Face Hub and integrated into the MTEB framework. This work fills a critical gap in Dutch embedding technology and establishes a reusable methodological paradigm for embedding research in low-resource languages.

Technology Category

Application Category

📝 Abstract
Recently, embedding resources, including models, benchmarks, and datasets, have been widely released to support a variety of languages. However, the Dutch language remains underrepresented, typically comprising only a small fraction of the published multilingual resources. To address this gap and encourage the further development of Dutch embeddings, we introduce new resources for their evaluation and generation. First, we introduce the Massive Text Embedding Benchmark for Dutch (MTEB-NL), which includes both existing Dutch datasets and newly created ones, covering a wide range of tasks. Second, we provide a training dataset compiled from available Dutch retrieval datasets, complemented with synthetic data generated by large language models to expand task coverage beyond retrieval. Finally, we release a series of E5-NL models compact yet efficient embedding models that demonstrate strong performance across multiple tasks. We make our resources publicly available through the Hugging Face Hub and the MTEB package.
Problem

Research questions and friction points this paper is trying to address.

Addressing underrepresentation of Dutch in embedding resources
Introducing benchmark and models for Dutch text embeddings
Providing evaluation and generation resources for Dutch embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dutch embedding benchmark MTEB-NL
Training dataset with synthetic LLM data
Efficient E5-NL embedding models series
🔎 Similar Papers
No similar papers found.
N
Nikolay Banar
University of Antwerp
Ehsan Lotfi
Ehsan Lotfi
University of Antwerp
J
Jens Van Nooten
University of Antwerp
C
Cristina Arhiliuc
University of Antwerp
M
Marija Kliocaite
University of Antwerp
Walter Daelemans
Walter Daelemans
Professor of Computational Linguistics, University of Antwerp
Computational LinguisticsNatural Language ProcessingComputational Psycholinguistics