A Survey on Evaluating Large Language Models in Code Generation Tasks

📅 2024-08-29
🏛️ arXiv.org
📈 Citations: 6
Influential: 1
📄 PDF
🤖 AI Summary
Current LLM code generation evaluation frameworks suffer from narrow, static metrics that poorly reflect real-world software development practices. To address this, we propose the first multidimensional evaluation framework integrating compilation success rate, unit test pass rate, and runtime efficiency—augmented with practical dimensions including code readability and human-centered feedback. Through systematic literature review, benchmark comparison, expert evaluation, and user experience analysis, we identify critical limitations of prevailing datasets in dynamic, evolution-oriented development scenarios. We introduce the “development-evolution alignment” paradigm—a novel dynamic evaluation approach grounded in iterative, realistic coding workflows. Our comprehensive assessment guidelines span functional correctness, practical utility, and human factors, providing both theoretical foundations and actionable methodologies for standardizing LLM code capability evaluation and guiding model refinement. (138 words)

Technology Category

Application Category

📝 Abstract
This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation. The paper begins by reviewing the historical development of LLMs and their applications in code generation. Next, it details various methods and metrics for assessing the code generation capabilities of LLMs, including code correctness, efficiency, readability, and evaluation methods based on expert review and user experience. The paper also evaluates the widely used benchmark datasets, identifying their limitations and proposing directions for future improvements. Specifically, the paper analyzes the performance of code generation models across different tasks by combining multiple evaluation metrics, such as code compilation/interpretation success rates, unit test pass rates, and performance and efficiency metrics, to comprehensively assess the practical application of LLMs in code generation. Finally, the paper discusses the challenges faced in evaluating LLMs in code generation, particularly how to ensure the comprehensiveness and accuracy of evaluation methods and how to adapt to the evolving practices of software development. These analyses and discussions provide valuable insights for further optimizing and improving the application of LLMs in code generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating performance of Large Language Models in code generation.
Identifying limitations in current benchmark datasets for code generation.
Addressing challenges in comprehensive and accurate evaluation methods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines multiple metrics for comprehensive evaluation
Analyzes code generation across diverse tasks
Proposes improvements for benchmark datasets
🔎 Similar Papers
No similar papers found.
L
Liguo Chen
Peking University, Beijing, China
Q
Qi Guo
Peking University, Beijing, China
H
Hongrui Jia
Peking University, Beijing, China
Zhengran Zeng
Zhengran Zeng
Peking University
Software EngineeringLLM4Code
X
Xin Wang
Peking University, Beijing, China
Y
Yijiang Xu
Peking University, Beijing, China
J
Jian Wu
Tokyo Institute of Technology, Tokyo, Japan
Y
Yidong Wang
Peking University, Beijing, China
Q
Qing Gao
Peking University, Beijing, China
Jindong Wang
Jindong Wang
Assistant Professor, William & Mary; Ex Senior Researcher, Microsoft Research
machine learningtransfer learninglarge language modelsgenerative AI
W
Wei Ye
Peking University, Beijing, China
Shikun Zhang
Shikun Zhang
北京大学