KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the performance bottleneck of compact text embedding models—specifically, their limited semantic representation capability and cross-task generalization under small parameter budgets. To this end, we propose a high-performance, parameter-efficient universal text embedding model based on a fully bidirectional Transformer architecture with mean pooling, trained via a three-stage paradigm: (1) large-scale weakly supervised pretraining, (2) fine-grained high-quality supervised fine-tuning, and (3) Model Soup-based parameter averaging. Key innovations include focal hard-negative reweighting, dynamic online hard negative mixing, and hierarchical data categorization. Evaluated on the multilingual MTEB benchmark, our model substantially outperforms same-sized state-of-the-art methods and matches or exceeds the performance of embedding models 3–26× larger in parameter count—establishing a new benchmark for lightweight text embeddings.

Technology Category

Application Category

📝 Abstract

In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.

Problem

Research questions and friction points this paper is trying to address.

Develop a versatile compact embedding model for text tasks

Enhance performance via advanced training techniques and data

Achieve robust generalization with multi-stage training pipeline

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional transformer with mean-pooling

Multi-stage training pipeline

Focal-style reweighting mechanism

🔎 Similar Papers

Revisiting Word Embeddings in the LLM Era