Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing LLM-based embedding models predominantly rely on LoRA fine-tuning, which suffers from misalignment between pretraining objectives and embedding-specific data distributions and training paradigms. This work proposes training a dedicated 1.4B-parameter embedding model from scratch to overcome inherent limitations of general-purpose LLMs. Our method introduces: (1) a learnable soft masking mechanism enabling smooth transition from causal to bidirectional attention; (2) a dynamic hard negative mining strategy to enhance discriminative capability; and (3) a high-quality cross-lingual retrieval dataset to improve multilingual semantic alignment. The model is pretrained on news and parallel corpora, then fine-tuned using sentence-level contrastive loss with hard-negative sampling. It achieves state-of-the-art performance on both the MTEB and Chinese MTEB benchmarks (v2025-05-19), demonstrating efficient, highly generalizable multilingual text embedding with relatively lightweight parameters.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).

Problem

Research questions and friction points this paper is trying to address.

Addressing data and training gaps in LLM-based text embeddings

Bridging cross-lingual retrieval challenges through specialized datasets

Resolving training objective conflicts between LLMs and embedding models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trained new 1.4B LLM from scratch

Introduced soft-masking transition mechanism

Proposed dynamic hard negative mining

🔎 Similar Papers

Revisiting Word Embeddings in the LLM Era