BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

📅 2024-02-05

🏛️ Annual Meeting of the Association for Computational Linguistics

📈 Citations: 895

✨ Influential: 92

career value

178K/year

🤖 AI Summary

Existing text embedding models suffer from three key limitations: limited multilingual support (<100 languages), functional monomorphism (supporting only dense, sparse, or multi-vector retrieval in isolation), and poor granularity adaptability (struggling with long documents up to 8K tokens). This work introduces the first unified multilingual, multimodal, and multigranular text embedding model—supporting over 100 working languages, integrating dense, multi-vector, and sparse retrieval paradigms within a single architecture, and accommodating inputs ranging from short sentences to 8,192-token documents. Methodologically, we propose a novel self-knowledge distillation framework that jointly constructs teacher signals from multi-paradigm retrieval scores; we further introduce granularity-aware batch optimization and cross-lingual contrastive learning to enhance embedding discriminability and generalization. Experiments demonstrate state-of-the-art performance on multilingual and cross-lingual retrieval benchmarks. The model and code are publicly released to facilitate lightweight deployment in industrial IR systems.

Technology Category

Application Category

📝 Abstract

In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.

Problem

Research questions and friction points this paper is trying to address.

Develops a versatile embedding model for multilingual semantic retrieval

Enables simultaneous dense, multi-vector, and sparse retrieval functionalities

Processes text inputs from short sentences to long documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-knowledge distillation integrates multiple retrieval functionalities

Optimized batching enables large batch size for discriminative embeddings

Uniform model supports multilingual, multifunctional, and multigranular text retrieval

🔎 Similar Papers

No similar papers found.