🤖 AI Summary
This paper addresses the coarse-grained file relevance modeling and insufficient contextual exploitation in warehouse-scale code search—particularly for defect repair. We propose the first multi-stage re-ranking framework tailored for code repositories. Methodologically, it begins with BM25 retrieval using commit messages, followed by a semantic-driven neural re-ranker based on CodeBERT. Crucially, we introduce fine-grained file-level relevance modeling via dual-modality matching between source code and historical commit messages, and explicitly leverage large-scale open-source project commit histories to enhance contextual awareness. Evaluated on a new benchmark comprising seven mainstream open-source repositories, our approach achieves up to an 80% improvement over BM25 in MAP, MRR, and P@1. These gains significantly boost the code understanding and retrieval accuracy of LLM-based agents.
📝 Abstract
This paper presents a multi-stage reranking system for repository-level code search, which leverages the vastly available commit histories of large open-source repositories to aid in bug fixing. We define the task of repository-level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user's question or bug. The proposed approach combines BM25-based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files. By learning patterns from diverse repositories and their commit histories, the system can surface relevant files for the task at hand. The system leverages both commit messages and source code for relevance matching, and is evaluated in both normal and oracle settings. Experiments on a new dataset created from 7 popular open-source repositories demonstrate substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25 baseline, across a diverse set of queries, demonstrating the effectiveness this approach. We hope this work aids LLM agents as a tool for better code search and understanding. Our code and results obtained are publicly available.