🤖 AI Summary
This study investigates the performance of large language models (LLMs) on source code natural language summarization. Methodologically, it systematically evaluates zero-shot, few-shot, chain-of-thought, critical-thinking, and expert prompting strategies, alongside decoding parameters (e.g., top_p, temperature), across procedural, object-oriented, and logic programming languages to assess cross-lingual generalization. Key contributions include: (1) empirical evidence that zero-shot prompting often outperforms more complex prompting techniques; (2) identification of CodeLlama-7B’s superiority over GPT-4 in capturing design intent and code semantics—despite GPT-4’s overall robustness; (3) revelation of a significant performance bottleneck for LLMs in summarizing logic programming code; (4) establishment of GPT-4 as the most reliable automatic evaluation benchmark; (5) first quantitative characterization of how prompting strategies and decoding parameters systematically influence summary quality; and (6) validation that compact open-source models can surpass larger proprietary ones on specific subtasks of code summarization.
📝 Abstract
To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of coderelated tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLMbased code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types (e.g., procedural and object-oriented programming languages). Finally, we unexpectedly find that CodeLlamaInstruct with 7B parameters can outperform advanced GPT-4 in generating summaries describing code design rationale and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.