From Code to Courtroom: LLMs as the New Software Judges

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study addresses the lack of efficient, multidimensional, and scalable quality assessment methods for software artifacts (e.g., code) generated by large language models (LLMs). To overcome the limitations of traditional automated metrics (e.g., BLEU), which fail to capture readability, practicality, and other software-engineering-relevant dimensions, we propose the “LLM-as-a-Judge” paradigm. Our work establishes the first systematic research blueprint for LLM-based judging in software engineering, integrating code understanding, logical reasoning, and human preference alignment. We identify critical gaps—including reliability assurance, cross-task generalization, and evaluation consistency—and articulate a 2030 vision roadmap covering evaluation framework design, robustness validation, and standardization pathways. The resulting framework constitutes the first structured, reproducible foundation for LLM-as-a-Judge evaluation tailored to software engineering, providing the SE community with a principled methodology to rigorously assess generative LLM outputs.

Technology Category

Application Category

📝 Abstract

Recently, Large Language Models (LLMs) have been increasingly used to automate SE tasks such as code generation and summarization. However, evaluating the quality of LLM-generated software artifacts remains challenging. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. Given that LLMs are typically trained to align with human judgment and possess strong coding abilities and reasoning skills, they hold promise as cost-effective and scalable surrogates for human evaluators. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed. This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLMgenerated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.

Problem

Research questions and friction points this paper is trying to address.

Challenges in evaluating LLM-generated software artifacts

Need for cost-effective and scalable evaluation methods

Advancing LLM-as-a-Judge frameworks for software quality assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs automate software artifact evaluation.

LLM-as-a-Judge paradigm replaces human evaluators.

Roadmap for scalable, robust LLM evaluation frameworks.

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval