An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Evaluating the correctness of software artifacts (e.g., code, patches, comments) generated by large language models (LLMs) remains challenging: human evaluation is accurate but infeasible at scale, while automated metrics are scalable yet suffer from low fidelity. To address this, we propose SWE-Judge—the first LLM-integrated judging framework tailored for software engineering tasks. Its core innovations are: (1) introducing the *LLM-as-Ensemble-Judge* paradigm; (2) designing five orthogonal judging strategies grounded in distinct correctness dimensions; and (3) incorporating dynamic team selection and relevance-weighted fusion to adaptively aggregate judgments. Evaluated across six benchmarks—including CoNaLa, HumanEval-X, and APPS—SWE-Judge improves correlation with human ratings by 5.9%–183.8% over state-of-the-art automated metrics. Notably, it achieves near-human-level consistency in both code generation and program repair tasks for the first time.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, other existing automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score through ensembling. We evaluate SWE-Judge across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess. These benchmarks span three SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SWE-Judge consistently achieves a higher correlation with human judgments, with improvements ranging from 5.9% to 183.8% over existing automatic metrics. Furthermore, SWE-Judge reaches agreement levels with human annotators that are comparable to inter-annotator agreement in code generation and program repair tasks. These findings underscore SWE-Judge's potential as a scalable and reliable alternative to human evaluation.

Problem

Research questions and friction points this paper is trying to address.

Assessing correctness of LLM-generated software artifacts accurately

Bridging gap between human evaluation and automatic metrics

Improving correlation with human judgments in SE tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-Ensemble-Judge for software artifact evaluation

Dynamic team selection of five evaluation strategies

Higher correlation with human judgments than metrics

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks