F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the limited support for low- and medium-resource languages and poor computational efficiency in existing large language models for multilingual embedding. We propose the F2LLM family of universal multilingual embedding models, covering over 200 languages, which uniquely integrates two-stage LLM embedding training, Matryoshka representation learning, model pruning, and knowledge distillation to substantially enhance inclusivity, performance, and efficiency. The F2LLM-v2-14B variant achieves state-of-the-art results across 11 MTEB benchmark tasks, while its lightweight counterparts establish new performance records under resource-constrained settings. All models and associated data are publicly released to foster further research and applications in multilingual representation learning.

Technology Category

Application Category

📝 Abstract

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

Problem

Research questions and friction points this paper is trying to address.

multilingual embeddings

low-resource languages

embedding efficiency

language inclusivity

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual embeddings

matryoshka learning

model pruning