QZhou-Embedding Technical Report

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address the limited generalization capability and poor multi-task adaptability of existing text embedding models, this paper introduces QZhou-Embedding—a general-purpose contextual embedding model built upon Qwen2.5-7B-Instruct. Methodologically, we design a unified multi-task learning framework integrated with an LLM-driven data synthesis pipeline—comprising semantic rewriting, positive example augmentation, and hard negative generation—and adopt a two-stage training strategy: retrieval pre-training followed by full-task fine-tuning. Our key innovation lies in adapting instruction tuning to embedding modeling, enabling explicit learning of context-aware semantic representations. Evaluated on the MTEB and CMTEB benchmarks, QZhou-Embedding achieves first-place rankings overall and sets new state-of-the-art results on subtasks including re-ranking and clustering. The model and code are fully open-sourced to ensure reproducibility.

Technology Category

Application Category

📝 Abstract

We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.

Problem

Research questions and friction points this paper is trying to address.

Develops a general-purpose contextual text embedding model

Enhances training with diverse data synthesis techniques

Improves retrieval and semantic representation capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task framework with specialized data transformation

LLM API pipeline for data synthesis and augmentation

Two-stage training strategy with retrieval pretraining

🔎 Similar Papers

No similar papers found.