🤖 AI Summary
Existing LLM-based embedding models predominantly rely on LoRA fine-tuning, which suffers from misalignment between pretraining objectives and embedding-specific data distributions and training paradigms. This work proposes training a dedicated 1.4B-parameter embedding model from scratch to overcome inherent limitations of general-purpose LLMs. Our method introduces: (1) a learnable soft masking mechanism enabling smooth transition from causal to bidirectional attention; (2) a dynamic hard negative mining strategy to enhance discriminative capability; and (3) a high-quality cross-lingual retrieval dataset to improve multilingual semantic alignment. The model is pretrained on news and parallel corpora, then fine-tuned using sentence-level contrastive loss with hard-negative sampling. It achieves state-of-the-art performance on both the MTEB and Chinese MTEB benchmarks (v2025-05-19), demonstrating efficient, highly generalizable multilingual text embedding with relatively lightweight parameters.
📝 Abstract
Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).