SweRank: Software Issue Localization with Code Ranking

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

Precise localization of software bugs requires mapping verbose, ambiguous natural language bug reports to specific code locations; however, existing approaches suffer from high latency, prohibitive cost (relying on proprietary large language models), or poor retrieval adaptability (conventional code ranking models are not optimized for bug localization). Method: We propose a lightweight, efficient retrieve-then-rerank framework featuring: (i) the first two-stage paradigm explicitly optimized for bug localization; (ii) SweLoc—the first large-scale, real-world GitHub issue-patch dataset; and (iii) an open-model stack comprising the Contriever dense retriever and a fine-tuned cross-encoder reranker. Contribution/Results: Our method achieves state-of-the-art performance on SWE-Bench-Lite and LocBench, significantly outperforming LLM-based agents (e.g., Claude-3.5) and mainstream code retrieval models—while relying solely on open-source models and incurring substantially lower computational overhead.

Technology Category

Application Category

📝 Abstract

Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.

Problem

Research questions and friction points this paper is trying to address.

Identifying code locations from issue descriptions efficiently

Reducing latency and cost of LLM-based localization methods

Improving accuracy of traditional code ranking models for verbose queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

SweRank uses retrieve-and-rerank framework

Constructs SweLoc dataset from GitHub

Outperforms prior models and agent-based systems

🔎 Similar Papers

BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning