Source Code Summarization in the Era of Large Language Models

📅 2024-07-09
🏛️ International Conference on Software Engineering
📈 Citations: 24
Influential: 2
📄 PDF
🤖 AI Summary
This study investigates the performance of large language models (LLMs) on source code natural language summarization. Methodologically, it systematically evaluates zero-shot, few-shot, chain-of-thought, critical-thinking, and expert prompting strategies, alongside decoding parameters (e.g., top_p, temperature), across procedural, object-oriented, and logic programming languages to assess cross-lingual generalization. Key contributions include: (1) empirical evidence that zero-shot prompting often outperforms more complex prompting techniques; (2) identification of CodeLlama-7B’s superiority over GPT-4 in capturing design intent and code semantics—despite GPT-4’s overall robustness; (3) revelation of a significant performance bottleneck for LLMs in summarizing logic programming code; (4) establishment of GPT-4 as the most reliable automatic evaluation benchmark; (5) first quantitative characterization of how prompting strategies and decoding parameters systematically influence summary quality; and (6) validation that compact open-source models can surpass larger proprietary ones on specific subtasks of code summarization.

Technology Category

Application Category

📝 Abstract
To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of coderelated tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLMbased code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types (e.g., procedural and object-oriented programming languages). Finally, we unexpectedly find that CodeLlamaInstruct with 7B parameters can outperform advanced GPT-4 in generating summaries describing code design rationale and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating automated methods for LLM-generated code summary quality
Exploring prompting techniques' effectiveness for code summarization tasks
Investigating LLM performance across different programming languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4 evaluation aligns best with human assessment
Zero-shot prompting outperforms advanced prompting techniques
CodeLlama-7B surpasses GPT-4 in implementation detail summarization
🔎 Similar Papers
No similar papers found.
Weisong Sun
Weisong Sun
Nanyang Technological University
Trustworthy Intelligent SE (Software Engineering)
Y
Yun Miao
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Yuekang Li
Yuekang Li
Lecturer (Assistant Professor), University of New South Wales
Software EngineeringSoftware SecurityAI Red Teaming
Hongyu Zhang
Hongyu Zhang
Chongqing University
Software EngineeringMining Software RepositoriesData-driven Software EngineeringSoftware Analytics
Chunrong Fang
Chunrong Fang
Software Institute, Nanjing University
Software TestingSoftware EngineeringComputer Science
Y
Yi Liu
College of Computing and Data Science, Nanyang Technological University Singapore, Singapore
Gelei Deng
Gelei Deng
Nanyang Technological University
CybersecuritySystem securityRobotics SecurityAI SecuritySoftware Testing
Y
Yang Liu
College of Computing and Data Science, Nanyang Technological University Singapore, Singapore
Z
Zhenyu Chen
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China