Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses systemic deficiencies—such as data leakage, human evaluation bias, and task oversimplification—in current large language model (LLM) benchmarks for software engineering (SE). We conduct the first large-scale, integrative meta-analysis of SE-LLM evaluation practices. Specifically, we systematically survey 191 SE-domain LLM benchmarks and propose a three-dimensional evaluation framework assessing task coverage, data quality, and evaluation validity. Leveraging bibliometric analysis, task taxonomy modeling, and empirical capability comparisons, we identify critical limitations across existing benchmarks and articulate an evolutionary pathway toward high-fidelity, scenario-aware, and dynamically adaptive benchmarking. Our findings culminate in a comprehensive SE-LLM benchmarking roadmap, which has been formally adopted as a design guideline by leading open-source evaluation initiatives, including CodeTrust and SEBench.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including requirements engineering and design, code analysis and generation, software maintenance, and quality assurance. As LLMs become more integral to SE, evaluating their effectiveness is crucial for understanding their potential in this field. In recent years, substantial efforts have been made to assess LLM performance in various SE tasks, resulting in the creation of several benchmarks tailored to this purpose. This paper offers a thorough review of 191 benchmarks, addressing three main aspects: what benchmarks are available, how benchmarks are constructed, and the future outlook for these benchmarks. We begin by examining SE tasks such as requirements engineering and design, coding assistant, software testing, AIOPs, software maintenance, and quality management. We then analyze the benchmarks and their development processes, highlighting the limitations of existing benchmarks. Additionally, we discuss the successes and failures of LLMs in different software tasks and explore future opportunities and challenges for SE-related benchmarks. We aim to provide a comprehensive overview of benchmark research in SE and offer insights to support the creation of more effective evaluation tools.
Problem

Research questions and friction points this paper is trying to address.

Evaluating effectiveness of LLMs in software engineering tasks
Reviewing existing benchmarks for LLM performance in SE
Identifying limitations and future challenges for SE benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reviewing 191 benchmarks for LLM evaluation
Analyzing SE tasks and benchmark limitations
Exploring future challenges in SE benchmarks
🔎 Similar Papers
No similar papers found.
X
Xing Hu
School of Software Technology, Zhejiang University, Ningbo, Zhejiang, China
Feifei Niu
Feifei Niu
University of Ottawa
software engineeringempirical software engineeringrequirements engineering
J
Junkai Chen
Zhejiang University, Hangzhou, China
X
Xin Zhou
School of Computing and Information Systems, Singapore Management University, Singapore
J
Junwei Zhang
Zhejiang University, Hangzhou, China
Junda He
Junda He
Singapore Management University
software engineering
X
Xin Xia
Zhejiang University, Hangzhou, Zhejiang, China
D
David Lo
School of Computing and Information Systems, Singapore Management University, Singapore