A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the performance and evaluation reliability of large language models (LLMs) in multilingual code comment generation across Chinese, Dutch, English, Greek, and Polish. To address the lack of standardized error taxonomies, we construct the first cross-lingual code comment error classification framework—comprising 26 fine-grained error types—via open coding and multilingual expert annotation. We systematically characterize error patterns across five state-of-the-art code LMs and uncover significant impacts of linguistic properties on comment coherence, informativeness, and grammatical correctness. Furthermore, we benchmark automated metrics (BLEU, CodeBLEU, BERTScore) against human expert judgments, revealing substantial score overlap between correct and erroneous comments—demonstrating their unreliability for fine-grained evaluation. As a key contribution, we release a publicly available dataset of 12,500 high-quality, multilingually annotated code-comment pairs, establishing a new benchmark and methodological foundation for multilingual code understanding research.

Technology Category

Application Category

📝 Abstract
Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment extit{correctness} across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.
Problem

Research questions and friction points this paper is trying to address.

Evaluating code language models in non-English multilingual contexts
Analyzing errors in LLM-generated code comments across five languages
Assessing reliability of standard metrics for comment correctness evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates code models in multilingual contexts
Analyzes errors in generated code comments
Assesses reliability of standard metrics
🔎 Similar Papers
No similar papers found.
J
Jonathan Katzy
Delft University of Technology, Delft, The Netherlands
Y
Yongcheng Huang
Delft University of Technology, Delft, The Netherlands
G
Gopal-Raj Panchu
Delft University of Technology, Delft, The Netherlands
M
Maksym Ziemlewski
Delft University of Technology, Delft, The Netherlands
P
Paris Loizides
Delft University of Technology, Delft, The Netherlands
S
Sander Vermeulen
Delft University of Technology, Delft, The Netherlands
Arie van Deursen
Arie van Deursen
Professor of Software Engineering, Delft University of Technology
Software engineeringsoftware testingempirical software engineeringdomain-specific languagesartificial intelligence
Maliheh Izadi
Maliheh Izadi
Assistant Professor @ Delft University of Technology, The Netherlands
Software engineeringEvaluationAI4SELLM4CodeAgents