Towards A Generalist Code Embedding Model Based On Massive Data Synthesis

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code embedding models suffer from scarce high-quality semantic annotations, resulting in poor generalization and inadequate support for universal code retrieval. Method: We propose CodeR, a novel model that introduces the DRU (Diversity, Reliability, Usability) principle to guide the construction of a synthetic dataset—CodeR-Pile—and designs an annealing-based curriculum learning strategy to enable knowledge transfer across heterogeneous, multi-source data. CodeR integrates a code synthesis pipeline, the DRU evaluation framework, contrastive learning, and a dual-encoder architecture. Contribution/Results: CodeR achieves state-of-the-art performance across 16 diverse code retrieval benchmarks and demonstrates significantly improved out-of-domain generalization. The model and all code are fully open-sourced, fostering community advancement in code intelligence research.

Technology Category

Application Category

📝 Abstract
Code embedding models attract increasing attention due to the widespread popularity of retrieval-augmented generation (RAG) in software development. These models are expected to capture the rich semantic relationships inherent to code, which differ significantly from those found in text. However, existing models remain severely limited due to the scarcity of high-quality training data. In this work, we introduce extbf{CodeR} (underline{Code} underline{R}etrieval), a state-of-the-art embedding model for general-purpose code retrieval. The superior performance of CodeR is built upon CodeR-Pile, a large-scale synthetic dataset constructed under the DRU (Diversity, Reliability, Usability) principle via a novel data synthesis pipeline. To optimize training effectiveness, we propose Annealing, a curriculum learning strategy that enables effective knowledge transfer across heterogeneous sources of data. We evaluate CodeR based on 16 diverse code retrieval tasks, where it significantly outperforms existing baselines and exhibits strong out-of-domain generalization performance. We have publicly released our code and the well-trained model to facilitate further research in this critical area. https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Coder.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of high-quality training data for code embedding models
Developing a general-purpose code retrieval model with superior performance
Enhancing training effectiveness across heterogeneous data sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale synthetic dataset CodeR-Pile
Novel data synthesis pipeline DRU principle
Annealing curriculum learning strategy
🔎 Similar Papers
No similar papers found.