Using Large Language Models to Document Code: A First Quantitative and Qualitative Assessment

📅 2024-08-26

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing evaluation methods for LLM-generated code comments rely on small-scale datasets and inadequate IR metrics (e.g., BLEU), failing to capture semantic fidelity. Method: We systematically assess GPT-3.5’s Javadoc generation for 23,850 Java code snippets, employing a dual-dimensional evaluation combining quantitative BLEU scoring with qualitative expert human assessment. Contribution/Results: Our study reveals a critical flaw in BLEU: high scores frequently correlate with low-quality, verbatim descriptions, while high-fidelity semantic paraphrasing is systematically penalized. We find that 69.7% of generated Javadocs are semantically equivalent to—or can be refined to match—the original quality, and 22.4% significantly surpass the originals. These results demonstrate that automated metrics alone are unreliable for assessing documentation quality. We advocate human evaluation as the gold standard, with BLEU serving only as a supplementary heuristic—establishing a new, more rigorous paradigm for evaluating code documentation generation.

Technology Category

Application Category

📝 Abstract

Code documentation is vital for software development, improving readability and comprehension. However, it's often skipped due to its labor-intensive nature. AI Language Models present an opportunity to automate the generation of code documentation, easing the burden on developers. While recent studies have explored the use of such models for code documentation, most rely on quantitative metrics like BLEU to assess the quality of the generated comments. Yet, the applicability and accuracy of these metrics on this scenario remain uncertain. In this paper, we leveraged OpenAI GPT-3.5 to regenerate the Javadoc of 23,850 code snippets with methods and classes. We conducted both quantitative and qualitative assessments, employing BLEU alongside human evaluation, to assess the quality of the generated comments. Our key findings reveal that: (i) in our qualitative analyses, when the documents generated by GPT were compared with the original ones, 69.7% were considered equivalent (45.7%) or required minor changes to be equivalent (24.0%); (ii) indeed, 22.4% of the comments were rated as having superior quality than the original ones; (iii) the use of quantitative metrics is susceptible to inconsistencies, for example, comments perceived as having higher quality were unjustly penalized by the BLEU metric.

Problem

Research questions and friction points this paper is trying to address.

Evaluates AI-generated code comment quality versus human-written ones

Identifies limitations of traditional metrics in assessing documentation quality

Explores relationship between code properties and AI comment effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale empirical study with post-training code

Qualitative expert assessment of AI-generated comments

Correlation analysis between code properties and comment quality

🔎 Similar Papers

No similar papers found.

Bloomberg

165,000 - 260,000 USD Annual + Benefits + Bonus

New York

Authors to Follow