🤖 AI Summary
Large language models (LLMs) increasingly generate substantial software artifacts, yet scalable, reliable, and fine-grained quality assessment methods remain lacking. Existing automated metrics suffer from semantic shallowness, while human evaluation is inherently unscalable.
Method: This paper proposes an LLM-as-a-Judge–based automated evaluation framework for software engineering. Grounded in a systematic literature review and gap analysis, it designs a robust, multi-dimensional intelligent judging system—covering functional correctness, maintainability, security, and more—and outlines a technical roadmap through 2030.
Contribution/Results: We present the first theoretical framework for LLM-as-a-Judge tailored to software engineering, formally defining evaluation dimensions, trustworthiness calibration mechanisms, and standardization pathways. This establishes a methodological foundation and practical guidance for automated, interpretable, and reproducible quality assessment of LLM-generated artifacts.
📝 Abstract
The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE studies, analyze their limitations, identify key research gaps, and outline a detailed roadmap. We envision these frameworks as reliable, robust, and scalable human surrogates capable of consistent, multi-faceted artifact evaluation by 2030. Our work aims to foster research and adoption of LLM-as-a-Judge frameworks, ultimately improving the scalability of software artifact evaluation.