Leveraging Large Language Model for Information Retrieval-based Bug Localization

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

In information retrieval–based fault localization, low accuracy arises from lexical mismatches between bug reports and source code. To address this, we propose GenLoc—a novel method that synergistically integrates large language models’ (LLMs) code comprehension and navigation capabilities with optional vector-based retrieval, enabling iterative, context-aware analysis to precisely identify candidate faulty files. Its core innovation lies in leveraging LLMs for active semantic understanding and codebase navigation—replacing brittle keyword matching—and supporting retrieval-augmented generation to enhance contextual modeling. Evaluated on six large Java projects comprising over 9,000 real-world bug reports, GenLoc achieves an average improvement of over 60% in Accuracy@1 compared to five state-of-the-art baselines. These results empirically validate the effectiveness and practicality of LLM-driven semantic alignment for fault localization.

Technology Category

Application Category

📝 Abstract

Information Retrieval-based Bug Localization aims to identify buggy source files for a given bug report. While existing approaches -- ranging from vector space models to deep learning models -- have shown potential in this domain, their effectiveness is often limited by the vocabulary mismatch between bug reports and source code. To address this issue, we propose a novel Large Language Model (LLM) based bug localization approach, called GenLoc. Given a bug report, GenLoc leverages an LLM equipped with code-exploration functions to iteratively analyze the code base and identify potential buggy files. To gather better context, GenLoc may optionally retrieve semantically relevant files using vector embeddings. GenLoc has been evaluated on over 9,000 real-world bug reports from six large-scale Java projects. Experimental results show that GenLoc outperforms five state-of-the-art bug localization techniques across multiple metrics, achieving an average improvement of more than 60% in Accuracy@1.

Problem

Research questions and friction points this paper is trying to address.

Addressing vocabulary mismatch in bug localization

Improving buggy file identification using LLMs

Enhancing accuracy in retrieval-based bug localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Large Language Model for bug localization

Iteratively analyzes code with code-exploration functions

Retrieves relevant files using vector embeddings

🔎 Similar Papers

BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning