Investigating the Use of LLMs for Evidence Briefings Generation in Software Engineering

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility and quality of Retrieval-Augmented Generation (RAG)-enhanced large language models (LLMs) for automatically generating evidence briefs in software engineering, evaluated along three dimensions: content fidelity, comprehensibility, and practical utility—benchmarking against manually produced briefs. Methodologically, it pioneers the application of RAG to this domain, implementing an end-to-end automated briefing system that synthesizes and structures findings from systematic reviews. Its key contributions are: (1) a novel multidimensional evaluation framework co-designed for both researchers and practitioners; and (2) empirical evidence characterizing the relative strengths and limitations of LLM-generated briefs across critical quality metrics. Results indicate that RAG-LLM briefs achieve fidelity and utility comparable to human-authored counterparts while substantially improving generation efficiency—demonstrating a scalable, evidence-driven pathway for software engineering practice.

Technology Category

Application Category

📝 Abstract
[Context] An evidence briefing is a concise and objective transfer medium that can present the main findings of a study to software engineers in the industry. Although practitioners and researchers have deemed Evidence Briefings useful, their production requires manual labor, which may be a significant challenge to their broad adoption. [Goal] The goal of this registered report is to describe an experimental protocol for evaluating LLM-generated evidence briefings for secondary studies in terms of content fidelity, ease of understanding, and usefulness, as perceived by researchers and practitioners, compared to human-made briefings. [Method] We developed an RAG-based LLM tool to generate evidence briefings. We used the tool to automatically generate two evidence briefings that had been manually generated in previous research efforts. We designed a controlled experiment to evaluate how the LLM-generated briefings compare to the human-made ones regarding perceived content fidelity, ease of understanding, and usefulness. [Results] To be reported after the experimental trials. [Conclusion] Depending on the experiment results.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-generated evidence briefings for software engineering studies
Comparing LLM and human-made briefings on fidelity and usefulness
Assessing automated briefing generation to reduce manual labor
Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG-based LLM tool for briefing generation
Automated evidence briefings from secondary studies
Controlled experiment comparing LLM and human briefings
🔎 Similar Papers
No similar papers found.