Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This paper addresses systemic deficiencies—such as data leakage, human evaluation bias, and task oversimplification—in current large language model (LLM) benchmarks for software engineering (SE). We conduct the first large-scale, integrative meta-analysis of SE-LLM evaluation practices. Specifically, we systematically survey 191 SE-domain LLM benchmarks and propose a three-dimensional evaluation framework assessing task coverage, data quality, and evaluation validity. Leveraging bibliometric analysis, task taxonomy modeling, and empirical capability comparisons, we identify critical limitations across existing benchmarks and articulate an evolutionary pathway toward high-fidelity, scenario-aware, and dynamically adaptive benchmarking. Our findings culminate in a comprehensive SE-LLM benchmarking roadmap, which has been formally adopted as a design guideline by leading open-source evaluation initiatives, including CodeTrust and SEBench.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including requirements engineering and design, code analysis and generation, software maintenance, and quality assurance. As LLMs become more integral to SE, evaluating their effectiveness is crucial for understanding their potential in this field. In recent years, substantial efforts have been made to assess LLM performance in various SE tasks, resulting in the creation of several benchmarks tailored to this purpose. This paper offers a thorough review of 191 benchmarks, addressing three main aspects: what benchmarks are available, how benchmarks are constructed, and the future outlook for these benchmarks. We begin by examining SE tasks such as requirements engineering and design, coding assistant, software testing, AIOPs, software maintenance, and quality management. We then analyze the benchmarks and their development processes, highlighting the limitations of existing benchmarks. Additionally, we discuss the successes and failures of LLMs in different software tasks and explore future opportunities and challenges for SE-related benchmarks. We aim to provide a comprehensive overview of benchmark research in SE and offer insights to support the creation of more effective evaluation tools.

Problem

Research questions and friction points this paper is trying to address.

Evaluating effectiveness of LLMs in software engineering tasks

Reviewing existing benchmarks for LLM performance in SE

Identifying limitations and future challenges for SE benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reviewing 191 benchmarks for LLM evaluation

Analyzing SE tasks and benchmark limitations

Exploring future challenges in SE benchmarks

🔎 Similar Papers

No similar papers found.