🤖 AI Summary
This study addresses the uncontrolled quality of quality engineering (QE) artifacts—such as requirements specifications, test cases, and Behavior-Driven Development (BDD) scenarios—automatically generated by large language models (LLMs). We propose an iterative optimization framework integrating forward generation, backward generation, and rubric-guided scoring to enhance artifact quality along four dimensions: clarity, completeness, consistency, and testability. Our approach enables automated, quantitative, and reproducible quality assessment and improvement. Evaluated across 12 real-world projects, the method significantly improves output stability: it preserves high quality under high-quality inputs and substantially outperforms baselines under low-quality inputs. The core contribution is the first integration of backward generation with structured rubric-based guidance, establishing a closed-loop, artifact-centric quality enhancement paradigm for QE.
📝 Abstract
Large Language Models (LLMs) are transforming Quality Engineering (QE) by automating the generation of artefacts such as requirements, test cases, and Behavior Driven Development (BDD) scenarios. However, ensuring the quality of these outputs remains a challenge. This paper presents a systematic technique to baseline and evaluate QE artefacts using quantifiable metrics. The approach combines LLM-driven generation, reverse generation , and iterative refinement guided by rubrics technique for clarity, completeness, consistency, and testability. Experimental results across 12 projects show that reverse-generated artefacts can outperform low-quality inputs and maintain high standards when inputs are strong. The framework enables scalable, reliable QE artefact validation, bridging automation with accountability.