🤖 AI Summary
This study addresses the significant challenge posed by a high volume of false-positive defect reports in Linux kernel development, which severely hinders the timely resolution of genuine issues. It presents the first systematic empirical investigation into false-positive kernel defects, introducing a manually annotated dataset of 2,006 reports. The work analyzes the root causes of these misclassifications and proposes a novel approach leveraging retrieval-augmented generation (RAG) with large language models to automatically identify false positives. Experimental results demonstrate that the proposed method achieves a recall of 91% and an F1 score of 88% on the false-positive identification task, substantially improving triage efficiency. This research offers a practical and effective technical pathway for managing defect reports in open-source software ecosystems.
📝 Abstract
False-positive bug reports represent a significant yet underexplored challenge in the development and maintenance of the Linux kernel. They occur when correct system behavior is mistakenly flagged as a defect, consuming developer effort without leading to actual code improvements. Such reports can mislead developers, waste debugging resources, and delay the resolution of real bugs. In this paper, we present the first comprehensive empirical study of false-positive bug reports in the Linux kernel. We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller. Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time. They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings. To address this challenge, we evaluate large language models (LLMs) for automated false-positive bug report mitigation. Among various prompting strategies, retrieval-augmented generation (RAG) performs best, achieving 91% recall and an F1 score of 88%. These findings highlight the non-negligible cost of false positive bug reports and show the promise of LLMs for more efficient false positive mitigation in the Linux kernel.