CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

📅 2025-05-31
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code retrieval benchmarks overemphasize functional relevance while neglecting critical quality dimensions—correctness, efficiency, security, and maintainability. This work introduces CoQuIR, the first large-scale, multilingual, quality-aware code retrieval benchmark, covering 11 programming languages, 42K natural-language queries, and 135K code snippets, with fine-grained quality annotations across multiple dimensions. We formally define the role of code quality in retrieval and propose two novel quality-oriented evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Leveraging a multilingual quality annotation framework, synthetic perturbation-based data augmentation, and quality-aware fine-tuning, we conduct comprehensive benchmarking across 23 models. Results show that quality-aware training significantly improves quality-oriented metrics without degrading semantic relevance. Furthermore, downstream code generation experiments demonstrate enhanced reliability of generated outputs, validating CoQuIR’s utility for building robust, production-ready code intelligence systems.

Technology Category

Application Category

📝 Abstract
Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency, security, and maintainability. CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages, and is accompanied by two quality-centric evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Using CoQuIR, we benchmark 23 retrieval models, covering both open-source and proprietary systems, and find that even top-performing models frequently fail to distinguish buggy or insecure code from their more robust counterparts. Furthermore, we conduct preliminary investigations into training methods that explicitly encourage retrievers to recognize code quality. Using synthetic datasets, we demonstrate promising improvements in quality-aware metrics across various models, without sacrificing semantic relevance. Downstream code generation experiments further validate the effectiveness of our approach. Overall, our work highlights the importance of integrating quality signals into code retrieval systems, laying the groundwork for more trustworthy and robust software development tools.
Problem

Research questions and friction points this paper is trying to address.

Evaluates code retrieval models on quality dimensions
Benchmarks 23 models across correctness, efficiency, security, maintainability
Addresses neglect of software quality in existing benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark for quality-aware code retrieval
Quality-centric evaluation metrics for code assessment
Training methods improving quality recognition without relevance loss
🔎 Similar Papers
No similar papers found.
Jiahui Geng
Jiahui Geng
Mohamed bin Zayed University of Artificial Intelligence
Artificial IntelligenceNatural Language Processing
Fengyu Cai
Fengyu Cai
TU Darmstadt
Natural Language ProcessingInformation Retrieval
S
Shaobo Cui
École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Q
Qing Li
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), UAE
L
Liangwei Chen
Google Tokyo, Japan
Chenyang Lyu
Chenyang Lyu
Alibaba
Large Language ModelsNatural Language ProcessingMachine Learning
H
Haonan Li
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), UAE
Derui Zhu
Derui Zhu
Technical University of Munich
NLPPrivacy and SecurityMachine LearningSoftware Engineering
W
Walter Pretschner
Technical University of Munich (TUM), Germany
Heinz Koeppl
Heinz Koeppl
Technische Universität Darmstadt, Dept. Electrical Engineering and Dept. Biology
synthetic biologymachine learningmulti-agent systemsself-organizationcollective intelligence
F
Fakhri Karray
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), UAE