Gotta catch 'em all! Towards File Localisation from Issues at Large

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This paper addresses file localization for **all types of issue reports**—not limited to bugs—in software repositories, overcoming the limitations of traditional bug-centric approaches. We propose a generic data processing pipeline integrating version history analysis, commit-association reasoning, and information retrieval techniques, and establish a type-agnostic benchmark evaluation framework. Experimental results show that bug-heuristic methods suffer substantial performance degradation in general-issue scenarios; localization effectiveness varies significantly across issue types (statistically significant); identifier-based features exert limited influence, whereas project-specific characteristics dominate performance. Our key contribution is the first systematic empirical validation that issue type substantially affects file localization accuracy—revealing the necessity of issue-type-aware modeling and providing both theoretical foundations and practical guidelines for general issue-driven software maintenance. (149 words)

Technology Category

Application Category

📝 Abstract

Bug localisation, the study of developing methods to localise the files requiring changes to resolve bugs, has been researched for a long time to develop methods capable of saving developers' time. Recently, researchers are starting to consider issues outside of bugs. Nevertheless, most existing research into file localisation from issues focusses on bugs or uses other selection methods to ensure only certain types of issues are considered as part of the focus of the work. Our goal is to work on all issues at large, without any specific selection. In this work, we provide a data pipeline for the creation of issue file localisation datasets, capable of dealing with arbitrary branching and merging practices. We provide a baseline performance evaluation for the file localisation problem using traditional information retrieval approaches. Finally, we use statistical analysis to investigate the influence of biases known in the bug localisation community on our dataset. Our results show that methods designed using bug-specific heuristics perform poorly on general issue types, indicating a need for research into general purpose models. Furthermore, we find that there are small, but statistically significant differences in performance between different issue types. Finally, we find that the presence of identifiers have a small effect on performance for most issue types. Many results are project-dependent, encouraging the development of methods which can be tuned to project-specific characteristics.

Problem

Research questions and friction points this paper is trying to address.

Develop file localization methods for all issue types

Create datasets handling arbitrary branching and merging

Evaluate bias effects and performance across issue types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data pipeline for issue file localisation datasets

Baseline evaluation using information retrieval

Statistical analysis of biases in datasets

🔎 Similar Papers

BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning