🤖 AI Summary
This work addresses the barriers to equitable development of high-quality text embeddings—namely high costs, limited language coverage, and model opacity—by introducing a three-dimensional Matryoshka Learning framework (3D-ML). This novel approach integrates Matryoshka learning across representation, layer, and embedding dimensions, establishing Matryoshka Embedding Learning (MEL) to achieve breakthroughs in parameter efficiency, inference flexibility, and storage compression. Leveraging this framework, the authors develop a large-scale multilingual embedding model covering over 100 languages and release the model, training data, and code publicly. Comprehensive evaluation across 430 tasks demonstrates state-of-the-art performance, setting new records on 9 out of 17 MTEB benchmarks, with particularly strong results on low-resource languages.
📝 Abstract
The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.