Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

The scarcity of high-quality labeled data in requirements engineering (RE) hinders the application of NLP and machine learning techniques. Method: This paper introduces, for the first time, product line engineering (PLE) principles into synthetic data generation for RE, proposing a feature-modeling-based framework that is controllable, reusable, and scalable. It integrates large language model (LLM) prompt engineering with systematic feature modeling to enable fine-grained specification of defect types and scenarios, and employs dual-dimensional evaluation—diversity and utility—to ensure synthetic data quality. Contribution/Results: Experiments demonstrate that synthetic data can partially substitute real data: hybrid training (real + synthetic) achieves 85% classification accuracy and doubles recall (100% improvement). The open-sourced toolchain and benchmark dataset foster reproducible research in RE.

Technology Category

Application Category

📝 Abstract

While modern Requirements Engineering (RE) heavily relies on natural language processing and Machine Learning (ML) techniques, their effectiveness is limited by the scarcity of high-quality datasets. This paper introduces Synthline, a Product Line (PL) approach that leverages Large Language Models to systematically generate synthetic RE data for classification-based use cases. Through an empirical evaluation conducted in the context of using ML for the identification of requirements specification defects, we investigated both the diversity of the generated data and its utility for training downstream models. Our analysis reveals that while synthetic datasets exhibit less diversity than real data, they are good enough to serve as viable training resources. Moreover, our evaluation shows that combining synthetic and real data leads to substantial performance improvements. Specifically, hybrid approaches achieve up to 85% improvement in precision and a 2x increase in recall compared to models trained exclusively on real data. These findings demonstrate the potential of PL-based synthetic data generation to address data scarcity in RE. We make both our implementation and generated datasets publicly available to support reproducibility and advancement in the field.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of high-quality RE datasets for NLP/ML

Generating synthetic RE data using Large Language Models

Improving ML model performance via hybrid real-synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Product Line approach for synthetic RE data

Leverages Large Language Models systematically

Combines synthetic and real data effectively

🔎 Similar Papers

No similar papers found.