๐ค AI Summary
Existing text embedding models suffer from three key limitations: limited multilingual support (<100 languages), functional monomorphism (supporting only dense, sparse, or multi-vector retrieval in isolation), and poor granularity adaptability (struggling with long documents up to 8K tokens). This work introduces the first unified multilingual, multimodal, and multigranular text embedding modelโsupporting over 100 working languages, integrating dense, multi-vector, and sparse retrieval paradigms within a single architecture, and accommodating inputs ranging from short sentences to 8,192-token documents. Methodologically, we propose a novel self-knowledge distillation framework that jointly constructs teacher signals from multi-paradigm retrieval scores; we further introduce granularity-aware batch optimization and cross-lingual contrastive learning to enhance embedding discriminability and generalization. Experiments demonstrate state-of-the-art performance on multilingual and cross-lingual retrieval benchmarks. The model and code are publicly released to facilitate lightweight deployment in industrial IR systems.
๐ Abstract
In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.