KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose embedding models suffer from weak multilingual generalization due to noisy and insufficiently diverse training data. To address this, we focus on data quality enhancement and propose three innovations: (1) persona-guided controllable synthetic data generation; (2) ranking-consistency-driven dynamic sample filtering; and (3) semi-isomorphic, task-aware batch sampling. We further pioneer the adaptation of the Qwen2-0.5B autoregressive language model—previously unexplored for embedding tasks—to construct a lightweight (<1B parameters) multilingual embedding framework. This framework integrates LLM distillation, dynamic filtering, and task-aware training. Evaluated on the MTEB multilingual benchmark, our model comprehensively outperforms all existing models of comparable scale, establishing a new state-of-the-art for sub-1B-parameter embedding models.

Technology Category

Application Category

📝 Abstract
As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with<1B parameters.
Problem

Research questions and friction points this paper is trying to address.

Generic Embedding Models
Training Data Quality
Model Performance Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

KaLM-Embedding
Qwen2-0.5B
MTEB Benchmark
🔎 Similar Papers
No similar papers found.
Xinshuo Hu
Xinshuo Hu
Harbin Institute of Technology, Shenzhen
Large Language ModelText GenerationTruthfulness
Zifei Shan
Zifei Shan
Applied Research at Tencent
machine learningnatural language processinglanguage modelsknowledge graphs
X
Xinping Zhao
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Z
Zetian Sun
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Z
Zhenyu Liu
Harbin Institute of Technology (Shenzhen), Shenzhen, China
D
Dongfang Li
Harbin Institute of Technology (Shenzhen), Shenzhen, China
S
Shaolin Ye
Harbin Institute of Technology (Shenzhen), Shenzhen, China
X
Xinyuan Wei
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Q
Qian Chen
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Baotian Hu
Baotian Hu
Harbin Institute of Technology (Shenzhen)
LLMMLLMNLP
M
Min Zhang
Harbin Institute of Technology (Shenzhen), Shenzhen, China