KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottleneck of compact text embedding models—specifically, their limited semantic representation capability and cross-task generalization under small parameter budgets. To this end, we propose a high-performance, parameter-efficient universal text embedding model based on a fully bidirectional Transformer architecture with mean pooling, trained via a three-stage paradigm: (1) large-scale weakly supervised pretraining, (2) fine-grained high-quality supervised fine-tuning, and (3) Model Soup-based parameter averaging. Key innovations include focal hard-negative reweighting, dynamic online hard negative mixing, and hierarchical data categorization. Evaluated on the multilingual MTEB benchmark, our model substantially outperforms same-sized state-of-the-art methods and matches or exceeds the performance of embedding models 3–26× larger in parameter count—establishing a new benchmark for lightweight text embeddings.

Technology Category

Application Category

📝 Abstract
In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.
Problem

Research questions and friction points this paper is trying to address.

Develop a versatile compact embedding model for text tasks
Enhance performance via advanced training techniques and data
Achieve robust generalization with multi-stage training pipeline
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional transformer with mean-pooling
Multi-stage training pipeline
Focal-style reweighting mechanism
🔎 Similar Papers
No similar papers found.
X
Xinping Zhao
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Xinshuo Hu
Xinshuo Hu
Harbin Institute of Technology, Shenzhen
Large Language ModelText GenerationTruthfulness
Zifei Shan
Zifei Shan
Applied Research at Tencent
machine learningnatural language processinglanguage modelsknowledge graphs
S
Shouzheng Huang
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Y
Yao Zhou
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Z
Zetian Sun
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Z
Zhenyu Liu
Harbin Institute of Technology (Shenzhen), Shenzhen, China
D
Dongfang Li
Harbin Institute of Technology (Shenzhen), Shenzhen, China
X
Xinyuan Wei
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Q
Qian Chen
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Y
Youcheng Pan
Pengcheng Laboratory, Shenzhen, China
Y
Yang Xiang
Pengcheng Laboratory, Shenzhen, China
Meishan Zhang
Meishan Zhang
Associate Professor, Harbin Institute of Technology at Shenzhen
Natural Language ProcessingComputational LinguisticsSyntax ParsingSentiment AnalysisMachine
Haofen Wang
Haofen Wang
Tongji University
Knowledge GraphNatural Language ProcessingRetrieval Augmented Generation
J
Jun Yu
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Baotian Hu
Baotian Hu
Harbin Institute of Technology (Shenzhen)
LLMMLLMNLP
M
Min Zhang
Harbin Institute of Technology (Shenzhen), Shenzhen, China