LLPut: Investigating Large Language Models for Bug Report-Based Input Generation

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the critical debugging task of automatically extracting fault-triggering inputs from software bug reports. We conduct the first systematic evaluation of three open-source generative large language models (LLMs)—LLaMA, Qwen, and Qwen-Coder—using a manually annotated dataset of 206 real-world defect reports and a prompt-engineering–driven generative approach. Results reveal significant limitations in current open-source LLMs: unstable extraction accuracy and substantial contextual misinterpretation, particularly in identifying precise input configurations that trigger faults. Our contributions include (1) establishing the first dedicated benchmark framework for fault-triggering input extraction; (2) empirically demonstrating that model architecture and code-awareness capabilities critically influence performance; and (3) providing reproducible evidence and concrete directions for improving LLM-based automated debugging. The findings underscore the need for enhanced code understanding and contextual grounding in LLMs to support reliable, production-ready debugging assistance.

Technology Category

Application Category

📝 Abstract

Failure-inducing inputs play a crucial role in diagnosing and analyzing software bugs. Bug reports typically contain these inputs, which developers extract to facilitate debugging. Since bug reports are written in natural language, prior research has leveraged various Natural Language Processing (NLP) techniques for automated input extraction. With the advent of Large Language Models (LLMs), an important research question arises: how effectively can generative LLMs extract failure-inducing inputs from bug reports? In this paper, we propose LLPut, a technique to empirically evaluate the performance of three open-source generative LLMs -- LLaMA, Qwen, and Qwen-Coder -- in extracting relevant inputs from bug reports. We conduct an experimental evaluation on a dataset of 206 bug reports to assess the accuracy and effectiveness of these models. Our findings provide insights into the capabilities and limitations of generative LLMs in automated bug diagnosis.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs for extracting failure-inducing inputs from bug reports

Assess accuracy of LLaMA, Qwen, Qwen-Coder in bug report analysis

Investigate generative LLMs' role in automated bug diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLaMA, Qwen, Qwen-Coder for input extraction

Evaluates LLMs on 206 bug reports

Assesses accuracy of generative LLMs

🔎 Similar Papers

No similar papers found.