🤖 AI Summary
To address the challenge of precise fault localization (FL) in code by large language models (LLMs) under zero-execution-environment and zero-annotated-data conditions, this paper proposes BAP—the first attention-based self-supervised probe for FL. Methodologically, BAP models defect-sensitive patterns via attention distribution, leveraging self-supervised contrastive learning, lightweight model distillation, and multi-language bug data generalization—eliminating reliance on line-level labels or runtime feedback. Evaluated on eight benchmark datasets, BAP achieves an average 34.6% improvement in top-1 localization accuracy over the strongest baseline and significantly outperforms GPT-4o zero-shot prompting by 93.4%, while reducing computational overhead by one to two orders of magnitude. This work establishes, for the first time, the feasibility and superiority of a purely attention-driven, self-supervised FL paradigm.
📝 Abstract
Ensuring code correctness remains a challenging problem even as large language models (LLMs) become increasingly capable at code-related tasks. While LLM-based program repair systems can propose bug fixes using only a user's bug report, their effectiveness is fundamentally limited by their ability to perform fault localization (FL), a challenging problem for both humans and LLMs. Existing FL approaches rely on executable test cases, require training on costly and often noisy line-level annotations, or demand resource-intensive LLMs. In this paper, we present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels, outperforming traditional FL baselines and prompting of large-scale LLMs. We evaluate our approach across a variety of code settings, including real-world Java bugs from the standard Defects4J dataset as well as seven other datasets which span a diverse set of bug types and languages. Averaged across all eight datasets, BAP improves by 34.6% top-1 accuracy compared to the strongest baseline and 93.4% over zero-shot prompting GPT-4o. BAP is also significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.