CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking

📅 2024-12-01

📈 Citations: 1

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Existing code embedding models exhibit poor generalization on complex retrieval tasks—e.g., GitHub bug localization—primarily due to noisy training data, inconsistent positive samples, and insufficiently hard negative samples. To address these issues, we propose CoRNStack: the first large-scale, multilingual, high-quality contrastive learning dataset for code, featuring consistency-based filtering to purify positive pairs and a hard negative mining strategy. We further pioneer the systematic application of contrastive learning to code re-ranking—integrating it within a two-stage retrieve-then-rerank architecture and enhancing it with multilingual representation modeling. Experiments demonstrate that our approach achieves state-of-the-art performance across multiple code retrieval benchmarks. Notably, on the GitHub function localization task, it improves MRR@10 significantly. Moreover, the joint retrieval-and-reranking pipeline substantially outperforms existing methods, establishing new performance ceilings in code semantic retrieval.

Technology Category

Application Category

📝 Abstract

Effective code retrieval plays a crucial role in advancing code generation, bug fixing, and software maintenance, particularly as software systems increase in complexity. While current code embedding models have demonstrated promise in retrieving code snippets for small-scale, well-defined tasks, they often underperform in more demanding real-world applications such as bug localization within GitHub repositories. We hypothesize that a key issue is their reliance on noisy and inconsistent datasets for training, which impedes their ability to generalize to more complex retrieval scenarios. To address these limitations, we introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages. This dataset is curated using consistency filtering to eliminate noisy positives and is further enriched with mined hard negatives, thereby facilitating more effective learning. We demonstrate that contrastive training of embedding models using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks. Furthermore, the dataset can be leveraged for training code reranking models, a largely underexplored area compared to text reranking. Our finetuned code reranking model significantly improves the ranking quality over the retrieved results. Finally, by employing our code retriever and reranker together, we demonstrate significant improvements in function localization for GitHub issues, an important component of real-world software development.

Problem

Research questions and friction points this paper is trying to address.

Improving code retrieval for complex software systems

Addressing noisy training data in code embedding models

Enhancing code reranking for better function localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale, high-quality contrastive training dataset

Consistency filtering and hard negatives enrichment

State-of-the-art code retrieval and reranking models

🔎 Similar Papers

Deep Code Search with Naming-Agnostic Contrastive Multi-view Learning