An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the correctness of software artifacts (e.g., code, patches, comments) generated by large language models (LLMs) remains challenging: human evaluation is accurate but infeasible at scale, while automated metrics are scalable yet suffer from low fidelity. To address this, we propose SWE-Judge—the first LLM-integrated judging framework tailored for software engineering tasks. Its core innovations are: (1) introducing the *LLM-as-Ensemble-Judge* paradigm; (2) designing five orthogonal judging strategies grounded in distinct correctness dimensions; and (3) incorporating dynamic team selection and relevance-weighted fusion to adaptively aggregate judgments. Evaluated across six benchmarks—including CoNaLa, HumanEval-X, and APPS—SWE-Judge improves correlation with human ratings by 5.9%–183.8% over state-of-the-art automated metrics. Notably, it achieves near-human-level consistency in both code generation and program repair tasks for the first time.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, other existing automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score through ensembling. We evaluate SWE-Judge across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess. These benchmarks span three SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SWE-Judge consistently achieves a higher correlation with human judgments, with improvements ranging from 5.9% to 183.8% over existing automatic metrics. Furthermore, SWE-Judge reaches agreement levels with human annotators that are comparable to inter-annotator agreement in code generation and program repair tasks. These findings underscore SWE-Judge's potential as a scalable and reliable alternative to human evaluation.
Problem

Research questions and friction points this paper is trying to address.

Assessing correctness of LLM-generated software artifacts accurately
Bridging gap between human evaluation and automatic metrics
Improving correlation with human judgments in SE tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-Ensemble-Judge for software artifact evaluation
Dynamic team selection of five evaluation strategies
Higher correlation with human judgments than metrics
X
Xin Zhou
Singapore Management University, Singapore
Kisub Kim
Kisub Kim
Assistant Professor @ DGIST, Korea
AI for Software EngineeringLarge Language ModelsSoftware AnalyticsManufacturing AI
T
Ting Zhang
Singapore Management University, Singapore
Martin Weyssow
Martin Weyssow
Research Scientist, Singapore Management University
Deep Learning for CodeLarge Language ModelsAI4SE
L
Luis F. Gomes
Carnegie Mellon University, USA
G
Guang Yang
Nanjing University of Aeronautics and Astronautics, China
D
David Lo
Singapore Management University, Singapore