🤖 AI Summary
To address the limited generalization capability and poor multi-task adaptability of existing text embedding models, this paper introduces QZhou-Embedding—a general-purpose contextual embedding model built upon Qwen2.5-7B-Instruct. Methodologically, we design a unified multi-task learning framework integrated with an LLM-driven data synthesis pipeline—comprising semantic rewriting, positive example augmentation, and hard negative generation—and adopt a two-stage training strategy: retrieval pre-training followed by full-task fine-tuning. Our key innovation lies in adapting instruction tuning to embedding modeling, enabling explicit learning of context-aware semantic representations. Evaluated on the MTEB and CMTEB benchmarks, QZhou-Embedding achieves first-place rankings overall and sets new state-of-the-art results on subtasks including re-ranking and clustering. The model and code are fully open-sourced to ensure reproducibility.
📝 Abstract
We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.