🤖 AI Summary
This work addresses the critical debugging task of automatically extracting fault-triggering inputs from software bug reports. We conduct the first systematic evaluation of three open-source generative large language models (LLMs)—LLaMA, Qwen, and Qwen-Coder—using a manually annotated dataset of 206 real-world defect reports and a prompt-engineering–driven generative approach. Results reveal significant limitations in current open-source LLMs: unstable extraction accuracy and substantial contextual misinterpretation, particularly in identifying precise input configurations that trigger faults. Our contributions include (1) establishing the first dedicated benchmark framework for fault-triggering input extraction; (2) empirically demonstrating that model architecture and code-awareness capabilities critically influence performance; and (3) providing reproducible evidence and concrete directions for improving LLM-based automated debugging. The findings underscore the need for enhanced code understanding and contextual grounding in LLMs to support reliable, production-ready debugging assistance.
📝 Abstract
Failure-inducing inputs play a crucial role in diagnosing and analyzing software bugs. Bug reports typically contain these inputs, which developers extract to facilitate debugging. Since bug reports are written in natural language, prior research has leveraged various Natural Language Processing (NLP) techniques for automated input extraction. With the advent of Large Language Models (LLMs), an important research question arises: how effectively can generative LLMs extract failure-inducing inputs from bug reports? In this paper, we propose LLPut, a technique to empirically evaluate the performance of three open-source generative LLMs -- LLaMA, Qwen, and Qwen-Coder -- in extracting relevant inputs from bug reports. We conduct an experimental evaluation on a dataset of 206 bug reports to assess the accuracy and effectiveness of these models. Our findings provide insights into the capabilities and limitations of generative LLMs in automated bug diagnosis.